📊 ArXiv 研究报告 (2026-04-07)

生成时间: 2026-04-07 09:15:14 数据源: ArXiv

📌 配置信息

关键词列表（共 27 个，总权重 27.0）

关键词	权重	类型
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	主要
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	主要
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	主要
“Scaling Laws” AND “Data Quality”	1.0	主要
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	主要
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	主要
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	主要
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	主要
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	主要
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	主要
“Context Window Extension” OR “Long Context LLMs”	1.0	主要
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	主要
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	主要
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	主要
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	主要
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	主要
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	主要
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	主要
“Multi-agent Systems” OR “Agent Coordination”	1.0	主要
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	主要
“Speculative Decoding” OR “Inference Acceleration”	1.0	主要
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	主要
“Mechanistic Interpretability” OR “Explainable AI”	1.0	主要
“World Models” AND “General World Models”	1.0	主要
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	主要
“In-context Learning” OR “Many-shot Learning”	1.0	主要
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	主要

评分设置

每个关键词最大分: 15
及格分公式: 5.0 + 0.8 × 总权重
当前及格分: 26.6

📈 论文统计

总抓取: 242 篇
及格论文: 9 篇 (3.7%)

⭐ 及格论文详细分析

1. AutoVerifier: An Agentic Automated Verification Framework Using Large Language Models

作者: Yuntao Du, Minh Dinh, Kaiyuan Zhang, Ninghui Li 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02617v1

评分: 57.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	5.0/10	5.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	8.0/10	8.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	8.0/10	8.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	8.0/10	8.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	8.0/10	8.0

评分理由: 论文提出AutoVerifier，一个基于LLM的智能体框架，用于自动化验证技术声明，因此与"Large Language Models"和"LLM Agents"高度相关（10分）。其结构化推理过程涉及多步分析和证据追踪，与"Chain of Thought"和"System 2 Thinking"相关（8分）。框架旨在提高事实性和减少错误，与"Hallucination Mitigation"相关（8分）。应用场景为科学和技术情报分析，属于"AI for Science"范畴（8分）。框架涉及检索和生成知识图谱，与"Retrieval-Augmented Generation"有一定关联（5分）。其他关键词如MoE、量化、训练方法等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了AutoVerifier，一个基于大语言模型的智能体框架，用于自动化验证复杂技术声明，通过结构化推理和证据追踪，成功在量子计算案例中识别了过度声明、指标不一致和未披露的利益冲突，从而将原始技术文档转化为可追溯的、有证据支持的情报评估。

摘要翻译

科技情报分析需要在快速增长的文献中验证复杂的技术主张，而现有方法难以弥合表层准确性与深层方法有效性之间的验证鸿沟。我们提出AutoVerifier——一个基于大语言模型的智能体框架，无需领域专业知识即可实现技术主张的端到端自动化验证。该框架将每个技术论断分解为（主体，谓词，客体）形式的结构化主张三元组，通过构建知识图谱实现跨六个渐进增强层的结构化推理：文献库构建与导入、实体与主张提取、文档内验证、跨源验证、外部信号佐证以及最终假设矩阵生成。我们以一项存在争议的量子计算主张为例展示AutoVerifier的效能：在操作者不具备量子专业知识的情况下，该框架自动识别出目标论文中的过度宣称与度量标准不一致问题，追溯跨文献矛盾，发现未披露的商业利益冲突，并生成最终评估报告。这些结果表明，结构化的大语言模型验证能够可靠评估新兴技术的有效性与成熟度，将原始技术文档转化为可追溯、有证据支撑的情报评估。

摘要 (Abstract)

Scientific and Technical Intelligence (S&TI) analysis requires verifying complex technical claims across rapidly growing literature, where existing approaches fail to bridge the verification gap between surface-level accuracy and deeper methodological validity. We present AutoVerifier, an LLM-based agentic framework that automates end-to-end verification of technical claims without requiring domain expertise. AutoVerifier decomposes every technical assertion into structured claim triples of the form (Subject, Predicate, Object), constructing knowledge graphs that enable structured reasoning across six progressively enriching layers: corpus construction and ingestion, entity and claim extraction, intra-document verification, cross-source verification, external signal corroboration, and final hypothesis matrix generation. We demonstrate AutoVerifier on a contested quantum computing claim, where the framework, operated by analysts with no quantum expertise, automatically identified overclaims and metric inconsistencies within the target paper, traced cross-source contradictions, uncovered undisclosed commercial conflicts of interest, and produced a final assessment. These results show that structured LLM verification can reliably evaluate the validity and maturity of emerging technologies, turning raw technical documents into traceable, evidence-backed intelligence assessments.

关键词: AutoVerifier, LLM-based agentic framework, technical claim verification, structured reasoning, knowledge graphs, Scientific and Technical Intelligence, quantum computing, evidence-backed assessment

2. Train Yourself as an LLM: Exploring Effects of AI Literacy on Persuasion via Role-playing LLM Traini

作者: Qihui Fan, Min Ge, Chenyan Jia, Weiyan Shi 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02637v1

评分: 56.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	8.0/10	8.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	10.0/10	10.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	8.0/10	8.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文研究LLM训练流程（预训练、SFT、RLHF）对AI素养和说服力的影响，与LLM、SFT、RLHF高度相关（10分），与预训练、事实性/真实性缓解相关（8分），与对齐、可解释AI有一定关联（5分），其他关键词未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文研究了通过角色扮演LLM训练流程（预训练、SFT、RLHF）的交互式AI素养教程LLMimic如何提高参与者的AI素养，减少AI说服成功率，并增强真实性和社会责任感。

摘要翻译

随着大型语言模型（LLM）的说服力日益增强，人们担忧其可能在不同情境下大规模影响公众的观点与决策。现有的缓解措施（如AI检测器和免责声明）大多将人类视为AI生成信息的被动接收者。为更主动地干预具有说服力的AI，我们提出了$\textbf{LLMimic}$——一种基于角色扮演、交互式、游戏化的AI素养教程。在该教程中，参与者扮演LLM的角色，并依次经历模型训练流程的三个关键阶段：预训练、监督微调（SFT）和基于人类反馈的强化学习（RLHF）。我们开展了一项$2 \times 3$的组间实验（$N = 274$），参与者被分为两组：一组观看AI发展史视频（对照组），另一组与LLMimic进行交互（实验组），随后参与三种现实AI说服场景中的一种：（a）慈善捐款劝说，（b）恶意资金索取，或（c）酒店推荐。研究结果表明，LLMimic显著提升了参与者的AI素养（$p < .001$），降低了各场景中的说服成功率（$p < .05$），并在酒店推荐场景中提高了回答的真实性与社会责任水平（$p<0.01$）。这些发现表明，LLMimic提供了一种可扩展、以人为本的方法，能够有效提升AI素养，并支持人们与具有说服力的AI进行更明智的互动。

摘要 (Abstract)

As large language models (LLMs) become increasingly persuasive, there is concern that people’s opinions and decisions may be influenced across various contexts at scale. Prior mitigation (e.g., AI detectors and disclaimers) largely treats people as passive recipients of AI-generated information. To provide a more proactive intervention against persuasive AI, we introduce $\textbf{LLMimic}$, a role-play-based, interactive, gamified AI literacy tutorial, where participants assume the role of an LLM and progress through three key stages of the training pipeline (pretraining, SFT, and RLHF). We conducted a $2 \times 3$ between-subjects study ($N = 274$) where participants either (1) watched an AI history video (control) or (2) interacted with LLMimic (treatment), and then engaged in one of three realistic AI persuasion scenarios: (a) charity donation persuasion, (b) malicious money solicitation, or (c) hotel recommendation. Our results show that LLMimic significantly improved participants’ AI literacy ($p < .001$), reduced persuasion success across scenarios ($p < .05$), and enhanced truthfulness and social responsibility levels ($p<0.01$) in the hotel scenario. These findings suggest that LLMimic offers a scalable, human-centered approach to improving AI literacy and supporting more informed interactions with persuasive AI.

关键词: large language models, AI literacy, persuasion, SFT, RLHF, pretraining, truthfulness, human-centered AI

3. R2-Write: Reflection and Revision for Open-Ended Writing with Deep Reasoning

作者: Wanlong Liu, Bo Zhang, Chenliang Li, Shaopeng Lai, Yuning Wu, Xuanyu Lei, Ming Yan 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03004v1

评分: 55.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	5.0/10	5.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	10.0/10	10.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	10.0/10	10.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	5.0/10	5.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	5.0/10	5.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文R2-Write专注于通过反思和修订模式提升大语言模型在开放式写作任务中的深度推理能力。核心相关关键词包括：LLMs（论文明确研究大语言模型在写作中的应用）、Chain of Thought/CoT Reasoning（论文系统研究长思维链在开放式任务中的有效性）、System 2 Thinking/In-depth Reasoning（论文旨在解锁深度推理能力）、Self-Correction/Self-Reflection（论文核心框架基于反思和修订模式）。部分相关关键词：RLHF（论文使用强化学习进行过程奖励监督，但非核心对齐方法）、LLM Agents/Multi-agent Systems（论文框架涉及writer-judge交互，可视为简单多智能体协调，但非主要焦点）。其余关键词与论文的技术方法（如MoE、量化、RAG等）或应用领域（如科学AI）无直接关联。

!!! tip deepseek-chat TL;DR

该论文针对现有大语言模型在开放式写作任务中缺乏深度反思和修订模式导致改进有限的问题，提出了R2-Write框架，通过迭代的writer-judge交互合成高质量的思维轨迹，并设计过程奖励机制监督反思质量，从而显著提升了模型在创意写作和深度研究基准上的性能。

摘要翻译

尽管长链思维深度推理已在数学等可验证领域显著提升了大语言模型的性能，其在开放式任务（如写作）中的有效性仍未被探索。本文通过系统性研究发现，现有主流推理模型在开放式写作任务上取得的提升有限。进一步分析表明，这些模型在开放式写作中缺乏深度反思与修订模式，导致其改进幅度远小于数学推理任务。为应对这一局限，我们提出了R2-Write：一种通过迭代式“作者-评判者”交互生成高质量思维轨迹的自动化框架，该轨迹融合了显性的反思与修订模式。为避免冗余反思，我们设计了过程奖励机制，在强化学习过程中监督反思质量，从而提升性能并优化计算效率。在多项创意写作与深度研究基准上的广泛实验证明了该方法的显著改进，验证了显式融入反思与修订模式能够为开放式写作任务解锁深度推理能力。

摘要 (Abstract)

While deep reasoning with long chain-of-thought has dramatically improved large language models in verifiable domains like mathematics, its effectiveness for open-ended tasks such as writing remains unexplored. In this paper, we conduct a systematic investigation revealing that existing mainstream reasoning models achieve limited gains on open-ended writing tasks. Our further analysis shows that these models lack deep reflection and revision patterns in open-ended writing, resulting in substantially smaller improvements compared to mathematical reasoning tasks. To address this limitation, we introduce R2-Write: an automated framework that synthesizes high-quality thinking trajectories enriched with explicit reflection and revision patterns through iterative writer-judge interaction. To prevent redundant reflections, we design a process reward mechanism that supervises reflection quality during reinforcement learning, improving both performance and token efficiency. Extensive experiments across multiple creative writing and deep-research benchmarks demonstrate significant improvements, validating that explicitly incorporating reflection and revision patterns unlocks deep reasoning capabilities for open-ended writing tasks.

关键词: Large Language Models, Open-ended Writing, Chain-of-Thought, Reflection, Revision, Deep Reasoning, Reinforcement Learning, Writer-Judge Interaction

4. Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attribute

作者: Zhennan Lin, Shuai Wang, Zhaokai Sun, Pengyuan Xie, Chuan Xie, Jie Liu, Qiang Zhang, Lei Xie 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03074v1

评分: 54.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	5.0/10	5.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	8.0/10	8.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	8.0/10	8.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	8.0/10	8.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文提出Speaker-Reasoner，一个用于多说话人自动语音识别（ASR）的端到端语音大语言模型（Speech LLM），核心创新在于采用代理式多轮时序推理（agentic multi-turn temporal reasoning）来处理重叠语音、快速话轮转换等挑战。因此，与"Large Language Models"（10分）和"LLM Agents"（10分）高度相关，因为模型被描述为具有代理推理能力的Speech LLM。与"Context Window Extension"（8分）相关，因为提到了speaker-aware cache以处理超出训练上下文窗口的音频。与"Chain of Thought"和"System 2 Thinking"（各8分）相关，因为模型采用迭代、多步的推理过程（分析全局结构、预测边界、细粒度分析）。与"Pre-training"和"Post-training"（各5分）有一定关联，因为提到了三阶段渐进训练策略，可能涉及预训练和微调。其他关键词如MoE、SLMs、Scaling Laws、Alignment、RAG、Quantization等，论文未直接涉及或不是核心，故给0分。

!!! tip deepseek-chat TL;DR

该论文针对多说话人场景下语音识别、说话人归属和时间戳定位的挑战，提出了Speaker-Reasoner——一种采用代理式多轮时序推理的端到端语音大语言模型，通过迭代分析和说话人感知缓存，在AliMeeting和AISHELL-4数据集上实现了对重叠语音和复杂话轮转换的持续改进。

摘要翻译

转录和理解多人对话需要完成语音识别、说话人归属和时间戳定位三项任务。虽然语音大语言模型在单人任务中表现出色，但在多人场景中，由于重叠语音、反馈性发声、快速话轮转换以及上下文窗口限制，该任务仍具挑战性。我们提出了Speaker-Reasoner，一种具备智能多轮时序推理能力的端到端语音大语言模型。该模型摒弃单次推理模式，转而迭代分析全局音频结构，自主预测时间边界，并进行细粒度片段分析，从而联合建模说话人身份、性别、时间戳和转录文本。一个说话人感知缓存机制进一步将处理能力扩展至超出训练上下文窗口的音频。通过采用三阶段渐进式策略进行训练，Speaker-Reasoner在AliMeeting和AISHELL-4数据集上相较于强基线模型取得了持续的性能提升，尤其在处理重叠语音和复杂话轮转换方面表现突出。

摘要 (Abstract)

Transcribing and understanding multi-speaker conversations requires speech recognition, speaker attribution, and timestamp localization. While speech LLMs excel at single-speaker tasks, multi-speaker scenarios remain challenging due to overlapping speech, backchannels, rapid turn-taking, and context window constraints. We propose Speaker-Reasoner, an end-to-end Speech LLM with agentic multi-turn temporal reasoning. Instead of single-pass inference, the model iteratively analyzes global audio structure, autonomously predicts temporal boundaries, and performs fine-grained segment analysis, jointly modeling speaker identity, gender, timestamps, and transcription. A speaker-aware cache further extends processing to audio exceeding the training context window. Trained with a three-stage progressive strategy, Speaker-Reasoner achieves consistent improvements over strong baselines on AliMeeting and AISHELL-4 datasets, particularly in handling overlapping speech and complex turn-taking.

关键词: Speech LLM, multi-speaker ASR, agentic reasoning, temporal reasoning, speaker attribution, timestamp localization, overlapping speech, context window extension

5. FoE: Forest of Errors Makes the First Solution the Best in Large Reasoning Models

作者: Kehan Jiang, Haonan Dong, Zhaolu Kang, Zhengzhou Zhu, Guojie Song 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02967v1

评分: 48.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	8.0/10	8.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	10.0/10	10.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	10.0/10	10.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文研究大型推理模型（LRMs）在复杂推理任务中的表现，核心发现是“第一个解决方案最优”现象，并提出了一个名为RED的自引导高效推理框架。该研究与以下关键词高度相关：1) “Large Language Models” OR “LLMs” OR “Foundation Models”（10分）：论文明确研究大型推理模型，如DeepSeek-R1，属于大模型范畴。2) “Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”（10分）：论文专注于复杂推理任务，涉及多步推理和思维链。3) “System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”（10分）：论文探讨深度推理和错误分析，与系统2思维相关。4) “Self-Correction” OR “Self-Improvement” OR “Self-Reflection”（10分）：RED框架包含自我引导的推理和错误抑制，涉及自我校正。5) “Scaling Laws” AND “Data Quality”（8分）：论文挑战测试时缩放定律，并分析错误缩放，与缩放定律有一定关联。其他关键词如MoE、SLMs、训练方法、RAG、量化等未在论文中提及，因此评分为0。

!!! tip deepseek-chat TL;DR

该论文研究发现，在大型推理模型中，第一个解决方案往往最优，后续方案可能因错误累积而变差，并基于此提出了RED框架，通过优化第一个方案和修剪后续方案，在提升性能的同时显著减少计算开销。

摘要翻译

近期的大型推理模型（Large Reasoning Models, LRMs），如DeepSeek-R1，在复杂推理任务中展现出显著的成功，其探索多种替代解决方案的模式呈现出类人的特点。然而，通过更细致的观察，我们发现了一个令人惊讶的现象：“首选即最佳”，即替代方案不仅并非更优，反而可能具有损害性。这一观察挑战了被广泛接受的测试时缩放定律，促使我们提出假设：推理路径中的错误会随着测试时间同步增加。通过全面的实证分析，我们将错误表征为一种森林结构的错误森林（Forest of Errors, FoE），并得出结论：错误森林使得“首选即最佳”现象成立，这一结论得到了严格理论分析的支持。基于这些洞见，我们提出了RED——一个自引导的高效推理框架，包含两个组成部分：I) 精炼首选，旨在抑制首个解决方案中错误森林的生长；以及II) 舍弃后续，通过双重一致性对后续的错误森林进行剪枝。在五个基准测试和六个骨干模型上进行的大量实验表明，RED优于八个竞争性基线方法，实现了高达19.0%的性能提升，同时将令牌消耗降低了37.7% ~ 70.4%。此外，在错误森林指标上的对比实验揭示了RED实现有效性的内在机制。

摘要 (Abstract)

Recent Large Reasoning Models (LRMs) like DeepSeek-R1 have demonstrated remarkable success in complex reasoning tasks, exhibiting human-like patterns in exploring multiple alternative solutions. Upon closer inspection, however, we uncover a surprising phenomenon: The First is The Best, where alternative solutions are not merely suboptimal but potentially detrimental. This observation challenges widely accepted test-time scaling laws, leading us to hypothesize that errors within the reasoning path scale concurrently with test time. Through comprehensive empirical analysis, we characterize errors as a forest-structured Forest of Errors (FoE) and conclude that FoE makes the First the Best, which is underpinned by rigorous theoretical analysis. Leveraging these insights, we propose RED, a self-guided efficient reasoning framework comprising two components: I) Refining First, which suppresses FoE growth in the first solution; and II) Discarding Subs, which prunes subsequent FoE via dual-consistency. Extensive experiments across five benchmarks and six backbone models demonstrate that RED outperforms eight competitive baselines, achieving performance gains of up to 19.0% while reducing token consumption by 37.7% ~ 70.4%. Moreover, comparative experiments on FoE metrics shed light on how RED achieves effectiveness.

关键词: Large Reasoning Models, Forest of Errors, First Solution, Self-guided Reasoning, Error Scaling, Reasoning Efficiency, Token Consumption Reduction, Dual-consistency Pruning

6. Overcoming the “Impracticality” of RAG: Proposing a Real-World Benchmark and Multi-Dimensional Diagn

作者: Kenichirou Narita, Siqi Peng, Taku Fukui, Moyuru Yamada, Satoshi Munakata, Satoru Takahashi 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02640v1

评分: 41.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	10.0/10	10.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	5.0/10	5.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	8.0/10	8.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究RAG系统在真实企业环境中的评估问题，与关键词"Retrieval-Augmented Generation"高度相关（10分）。论文涉及LLMs作为RAG的基础组件（8分），并强调企业部署中的可解释性要求，与"Explainable AI"相关（8分）。论文提到评估需考虑推理复杂性，与"Chain of Thought"和"System 2 Thinking"有一定关联（各5分）。论文关注RAG系统在实际部署中的可靠性，与缓解幻觉、提升事实性有一定关联（5分）。其他关键词如MoE、SFT、量化等与论文内容无直接关系，评0分。

!!! tip deepseek-chat TL;DR

该论文针对现有检索增强生成系统在企业环境中评估不足的问题，提出了一个多维诊断框架和真实世界基准，以系统性地诊断RAG系统在推理复杂性、检索难度、文档结构和可解释性等方面的潜在弱点。

摘要翻译

企业环境中检索增强生成（RAG）系统的性能评估受多维复合因素支配，其范围远超简单的最终准确性检验。这些因素包括推理复杂性、检索难度、文档的多样化结构以及对操作可解释性的严格要求。现有学术基准未能系统性地诊断这些相互关联的挑战，导致一个关键差距：在学术评估中获得高分的模型在实际部署中往往无法达到预期的可靠性。为弥合这一差异，本研究提出一个多维诊断框架，通过定义四轴难度分类法并将其整合到企业RAG基准中，以诊断系统的潜在弱点。

摘要 (Abstract)

Performance evaluation of Retrieval-Augmented Generation (RAG) systems within enterprise environments is governed by multi-dimensional and composite factors extending far beyond simple final accuracy checks. These factors include reasoning complexity, retrieval difficulty, the diverse structure of documents, and stringent requirements for operational explainability. Existing academic benchmarks fail to systematically diagnose these interlocking challenges, resulting in a critical gap where models achieving high performance scores fail to meet the expected reliability in practical deployment. To bridge this discrepancy, this research proposes a multi-dimensional diagnostic framework by defining a four-axis difficulty taxonomy and integrating it into an enterprise RAG benchmark to diagnose potential system weaknesses.

关键词: Retrieval-Augmented Generation, RAG, enterprise benchmark, diagnostic framework, reasoning complexity, retrieval difficulty, explainability, system reliability

7. Improving Role Consistency in Multi-Agent Collaboration via Quantitative Role Clarity

作者: Guoling Zhou, Wenpei Han, Fengqin Yang, Li Wang, Yingcong Zhou, Zhiguo Fu 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02770v1

评分: 40.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	5.0/10	5.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	5.0/10	5.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	10.0/10	10.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLM驱动的多智能体系统中的角色一致性问题，与"LLM Agents"和"Multi-agent Systems"高度相关（10分）。论文明确使用LLMs（10分），并采用轻量级微调方法，与"Post-training/SFT"和"PEFT/LoRA"有一定关联（5分）。其他关键词如MoE、Scaling Laws、RAG、CoT等未在摘要中提及或与核心问题无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对LLM驱动的多智能体系统中智能体不遵守角色规范的问题，提出了一种基于定量角色清晰度的正则化方法，通过轻量级微调显著提高了角色一致性和任务成功率。

摘要翻译

在大语言模型（LLM）驱动的多智能体系统中，角色规范违背（即未能遵循所分配角色的既定职责与约束，可能导致智能体行为类似于其他角色）是一种主要的故障模式。为应对此问题，本文提出一种定量的角色清晰度指标以提升角色一致性。首先，我们构建一个角色分配矩阵 $S(φ)=[s_{ij}(φ)]$，其中 $s_{ij}(φ)$ 表示第 $i$ 个智能体的行为轨迹与第 $j$ 个智能体的角色描述之间的语义相似度。随后，我们将角色清晰度矩阵 $M(φ)$ 定义为 $\text{softmax}(S(φ))-I$，其中 $\text{softmax}(S(φ))$ 是 $S(φ)$ 按行计算的 softmax 结果，$I$ 为单位矩阵。$M(φ)$ 的 Frobenius 范数可量化智能体角色描述与其行为轨迹之间的对齐程度。此外，我们在轻量级微调过程中将角色清晰度矩阵作为正则化项，以提升角色一致性，从而改善端到端任务性能。在 ChatDev 多智能体系统上的实验表明，我们的方法显著提升了角色一致性与任务性能：使用 Qwen 和 Llama 模型时，角色越界率分别从 $46.4%$ 降至 $8.4%$ 和从 $43.4%$ 降至 $0.2%$，角色清晰度得分分别从 $0.5328$ 提升至 $0.9097$ 和从 $0.5007$ 提升至 $0.8530$，任务成功率分别从 $0.6769$ 提升至 $0.6909$ 和从 $0.6174$ 提升至 $0.6763$。

摘要 (Abstract)

In large language model (LLM)-driven multi-agent systems, disobey role specification (failure to adhere to the defined responsibilities and constraints of an assigned role, potentially leading to an agent behaving like another) is a major failure mode \cite{DBLP:journals/corr/abs-2503-13657}. To address this issue, in the present paper, we propose a quantitative role clarity to improve role consistency. Firstly, we construct a role assignment matrix $S(φ)=[s_{ij}(φ)]$, where $s_{ij}(φ)$ is the semantic similarity between the $i$-th agent’s behavior trajectory and the $j$-th agent’s role description. Then we define role clarity matrix $M(φ)$ as $\text{softmax}(S(φ))-I$, where $\text{softmax}(S(φ))$ is a row-wise softmax of $S(φ)$ and $I$ is the identity matrix. The Frobenius norm of $M(φ)$ quantifies the alignment between agents’ role descriptions and their behaviors trajectory. Moreover, we employ the role clarity matrix as a regularizer during lightweight fine-tuning to improve role consistency, thereby improving end-to-end task performance. Experiments on the ChatDev multi-agent system show that our method substantially improves role consistency and task performance: with Qwen and Llama, the role overstepping rate decreases from $46.4%$ to $8.4%$ and from $43.4%$ to $0.2%$, respectively, and the role clarity score increases from $0.5328$ to $0.9097$ and from $0.5007$ to $0.8530$, respectively, the task success rate increases from $0.6769$ to $0.6909$ and from $0.6174$ to $0.6763$, respectively.

关键词: Large Language Models, Multi-agent Systems, Role Consistency, Role Clarity, Lightweight Fine-tuning, Agent Collaboration, Semantic Similarity, Task Performance

8. Supply-Chain Poisoning Attacks Against LLM Coding Agent Skill Ecosystems

作者: Yubin Qu, Yi Liu, Tongcheng Geng, Gelei Deng, Yuekang Li, Leo Yu Zhang, Ying Zhang, Lei Ma 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03081v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	5.0/10	5.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文研究LLM编码代理的安全漏洞，核心涉及LLM代理（LLM Agents）和供应链攻击，与"Large Language Models"和"LLM Agents"高度相关（10分）。论文提到对齐（alignment）检测和工具使用（如文件写入、shell命令），与"Instruction Tuning"和"Tool Use"有一定关联（5分）。其他关键词如MoE、SLMs、训练方法、推理技术、科学AI应用等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了针对LLM编码代理技能生态系统的供应链攻击，提出了一种通过文档嵌入恶意负载的方法（DDIPE），在多个框架和模型中实现了11.6%至33.5%的绕过率，并发现了现有防御措施的漏洞。

摘要翻译

基于大语言模型的编码智能体通过开放市场中分发的第三方智能体技能扩展其能力，这些技能无需经过强制性安全审查。与传统软件包不同，这些技能以具备系统级权限的操作指令形式执行，因此单个恶意技能即可危害宿主系统。尽管现有防护机制存在，先前研究尚未探讨供应链攻击是否能够直接劫持智能体的操作空间（例如文件写入、shell命令和网络请求）。我们提出文档驱动的隐式载荷执行（Document-Driven Implicit Payload Execution, DDIPE）攻击方法，将恶意逻辑嵌入技能文档内的代码示例和配置模板中。由于智能体在常规任务中会复用这些示例，载荷得以在无需显式指令触发的情况下执行。通过采用大语言模型驱动的流程，我们从81个种子技能中生成覆盖15个MITRE ATTACK类别的1,070个对抗性技能。在四个框架和五个模型的测试中，DDIPE实现了11.6%至33.5%的绕过率，而显式指令攻击在强防御机制下的成功率则为0%。静态分析可检测大多数案例，但仍有2.5%的案例能同时规避检测和对齐机制。通过负责任披露，我们已确认四个相关漏洞并推动完成两项修复。

摘要 (Abstract)

LLM-based coding agents extend their capabilities via third-party agent skills distributed through open marketplaces without mandatory security review. Unlike traditional packages, these skills are executed as operational directives with system-level privileges, so a single malicious skill can compromise the host. Prior work has not examined whether supply-chain attacks can directly hijack an agent’s action space, such as file writes, shell commands, and network requests, despite existing safeguards. We introduce Document-Driven Implicit Payload Execution (DDIPE), which embeds malicious logic in code examples and configuration templates within skill documentation. Because agents reuse these examples during normal tasks, the payload executes without explicit prompts. Using an LLM-driven pipeline, we generate 1,070 adversarial skills from 81 seeds across 15 MITRE ATTACK categories. Across four frameworks and five models, DDIPE achieves 11.6% to 33.5% bypass rates, while explicit instruction attacks achieve 0% under strong defenses. Static analysis detects most cases, but 2.5% evade both detection and alignment. Responsible disclosure led to four confirmed vulnerabilities and two fixes.

关键词: LLM-based coding agents, supply-chain attacks, agent skills, Document-Driven Implicit Payload Execution, security vulnerabilities, adversarial skills, bypass rates, static analysis

9. Rubrics to Tokens: Bridging Response-level Rubrics and Token-level Rewards in Instruction Following

作者: Tianze Xu, Yanzhao Zheng, Pengrui Lu, Lyumanshan Ye, Yong Wu, Zhentao Zhang, Yuanqiang Yu, Chao Ma, Jihuai Zhu, Pengfei Liu, Baohua Dong, Hangcheng Zhu, Ruohui Huang, Gang Yu 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02795v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	10.0/10	10.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	10.0/10	10.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）在指令跟随任务中的对齐问题，提出了一种基于准则的强化学习框架RTT，直接涉及"Large Language Models"、“Instruction Tuning/Alignment"和"RLHF/RLAIF/DPO"等关键词，并高度相关（10分）。其他关键词如MoE、SLMs、Scaling Laws、PEFT、RAG、推理方法、代理、压缩、科学AI等均未在摘要中提及，与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型在复杂开放域指令跟随任务中，基于准则的强化学习方法存在的奖励稀疏性和模糊性问题，提出了一种名为Rubrics to Tokens（RTT）的新框架，通过引入令牌级相关性判别器和RTT-GRPO优化方法，将粗粒度响应级评分与细粒度令牌级信用分配相结合，实验表明RTT在指令和准则级别准确性上均优于基线方法。

摘要翻译

基于量规的强化学习已成为使大语言模型与复杂开放域指令跟随任务对齐的有效方法。然而，现有方法主要依赖响应级奖励，导致严重的奖励稀疏性与奖励模糊性问题。为解决这些问题，我们提出“量规到令牌”方法，这是一种新颖的基于量规的强化学习框架，旨在连接粗粒度的响应级评分与细粒度的令牌级信用分配。该方法引入了一个令牌级相关性判别器，用于预测响应中哪些令牌应对特定约束负责，并通过RTT-GRPO算法优化策略模型，该算法将响应级与令牌级优势集成在统一框架内。此外，当基于令牌级量规的强化学习从一维结果级奖励转向三维奖励空间时，我们提出了一种新颖的组归一化方法——样本内令牌组归一化，以适应这一转变。大量实验与基准测试表明，在不同模型上，该方法在指令级与量规级准确率方面均持续优于其他基线模型。

摘要 (Abstract)

Rubric-based Reinforcement Learning (RL) has emerged as a promising approach for aligning Large Language Models (LLMs) with complex, open-domain instruction following tasks. However, existing methods predominantly rely on response-level rewards, introducing severe reward sparsity and reward ambiguity problems. To address these issues, we propose Rubrics to Tokens (RTT), a novel rubric-based RL framework that bridges coarse response-level scores and fine-grained token-level credit assignment. RTT introduces a Token-Level Relevance Discriminator to predict which tokens in the response are responsible for a specific constraint, and optimizes the policy model via RTT-GRPO, which integrates response-level and token-level advantages within a unified framework. Furthermore, when transitioning from one-dimensional, outcome-level reward to three-dimensional reward space in the token-level rubric-based RL, we propose a novel group normalization method, called Intra-sample Token Group Normalization, to accommodate this shift. Extensive experiments and benchmarks demonstrate that RTT consistently outperforms other baselines in both instruction- and rubric-level accuracy across different models.

关键词: Rubric-based Reinforcement Learning, Large Language Models, Instruction Following, Reward Sparsity, Token-level Credit Assignment, RTT-GRPO, Alignment, Benchmark Evaluation

📋 所有论文列表

1. ✅ AutoVerifier: An Agentic Automated Verification Framework Using Large Language Models

作者: Yuntao Du, Minh Dinh, Kaiyuan Zhang, Ninghui Li 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02617v1

评分: 57.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	5.0/10	5.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	8.0/10	8.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	8.0/10	8.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	8.0/10	8.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	8.0/10	8.0

!!! tip deepseek-chat TL;DR

该论文提出了AutoVerifier，一个基于大语言模型的智能体框架，用于自动化验证复杂技术声明，通过结构化推理和证据追踪，成功在量子计算案例中识别了过度声明、指标不一致和未披露的利益冲突，从而将原始技术文档转化为可追溯的、有证据支持的情报评估。

摘要翻译

科技情报分析需要在快速增长的文献中验证复杂的技术主张，而现有方法难以弥合表层准确性与深层方法有效性之间的验证鸿沟。我们提出AutoVerifier——一个基于大语言模型的智能体框架，无需领域专业知识即可实现技术主张的端到端自动化验证。该框架将每个技术论断分解为（主体，谓词，客体）形式的结构化主张三元组，通过构建知识图谱实现跨六个渐进增强层的结构化推理：文献库构建与导入、实体与主张提取、文档内验证、跨源验证、外部信号佐证以及最终假设矩阵生成。我们以一项存在争议的量子计算主张为例展示AutoVerifier的效能：在操作者不具备量子专业知识的情况下，该框架自动识别出目标论文中的过度宣称与度量标准不一致问题，追溯跨文献矛盾，发现未披露的商业利益冲突，并生成最终评估报告。这些结果表明，结构化的大语言模型验证能够可靠评估新兴技术的有效性与成熟度，将原始技术文档转化为可追溯、有证据支撑的情报评估。

摘要 (Abstract)

Scientific and Technical Intelligence (S&TI) analysis requires verifying complex technical claims across rapidly growing literature, where existing approaches fail to bridge the verification gap between surface-level accuracy and deeper methodological validity. We present AutoVerifier, an LLM-based agentic framework that automates end-to-end verification of technical claims without requiring domain expertise. AutoVerifier decomposes every technical assertion into structured claim triples of the form (Subject, Predicate, Object), constructing knowledge graphs that enable structured reasoning across six progressively enriching layers: corpus construction and ingestion, entity and claim extraction, intra-document verification, cross-source verification, external signal corroboration, and final hypothesis matrix generation. We demonstrate AutoVerifier on a contested quantum computing claim, where the framework, operated by analysts with no quantum expertise, automatically identified overclaims and metric inconsistencies within the target paper, traced cross-source contradictions, uncovered undisclosed commercial conflicts of interest, and produced a final assessment. These results show that structured LLM verification can reliably evaluate the validity and maturity of emerging technologies, turning raw technical documents into traceable, evidence-backed intelligence assessments.

2. ✅ Train Yourself as an LLM: Exploring Effects of AI Literacy on Persuasion via Role-playing LLM Training

作者: Qihui Fan, Min Ge, Chenyan Jia, Weiyan Shi 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02637v1

评分: 56.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	8.0/10	8.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	10.0/10	10.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	8.0/10	8.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文研究了通过角色扮演LLM训练流程（预训练、SFT、RLHF）的交互式AI素养教程LLMimic如何提高参与者的AI素养，减少AI说服成功率，并增强真实性和社会责任感。

摘要翻译

随着大型语言模型（LLM）的说服力日益增强，人们担忧其可能在不同情境下大规模影响公众的观点与决策。现有的缓解措施（如AI检测器和免责声明）大多将人类视为AI生成信息的被动接收者。为更主动地干预具有说服力的AI，我们提出了$\textbf{LLMimic}$——一种基于角色扮演、交互式、游戏化的AI素养教程。在该教程中，参与者扮演LLM的角色，并依次经历模型训练流程的三个关键阶段：预训练、监督微调（SFT）和基于人类反馈的强化学习（RLHF）。我们开展了一项$2 \times 3$的组间实验（$N = 274$），参与者被分为两组：一组观看AI发展史视频（对照组），另一组与LLMimic进行交互（实验组），随后参与三种现实AI说服场景中的一种：（a）慈善捐款劝说，（b）恶意资金索取，或（c）酒店推荐。研究结果表明，LLMimic显著提升了参与者的AI素养（$p < .001$），降低了各场景中的说服成功率（$p < .05$），并在酒店推荐场景中提高了回答的真实性与社会责任水平（$p<0.01$）。这些发现表明，LLMimic提供了一种可扩展、以人为本的方法，能够有效提升AI素养，并支持人们与具有说服力的AI进行更明智的互动。

摘要 (Abstract)

As large language models (LLMs) become increasingly persuasive, there is concern that people’s opinions and decisions may be influenced across various contexts at scale. Prior mitigation (e.g., AI detectors and disclaimers) largely treats people as passive recipients of AI-generated information. To provide a more proactive intervention against persuasive AI, we introduce $\textbf{LLMimic}$, a role-play-based, interactive, gamified AI literacy tutorial, where participants assume the role of an LLM and progress through three key stages of the training pipeline (pretraining, SFT, and RLHF). We conducted a $2 \times 3$ between-subjects study ($N = 274$) where participants either (1) watched an AI history video (control) or (2) interacted with LLMimic (treatment), and then engaged in one of three realistic AI persuasion scenarios: (a) charity donation persuasion, (b) malicious money solicitation, or (c) hotel recommendation. Our results show that LLMimic significantly improved participants’ AI literacy ($p < .001$), reduced persuasion success across scenarios ($p < .05$), and enhanced truthfulness and social responsibility levels ($p<0.01$) in the hotel scenario. These findings suggest that LLMimic offers a scalable, human-centered approach to improving AI literacy and supporting more informed interactions with persuasive AI.

关键词: large language models, AI literacy, persuasion, SFT, RLHF, pretraining, truthfulness, human-centered AI

3. ✅ R2-Write: Reflection and Revision for Open-Ended Writing with Deep Reasoning

作者: Wanlong Liu, Bo Zhang, Chenliang Li, Shaopeng Lai, Yuning Wu, Xuanyu Lei, Ming Yan 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03004v1

评分: 55.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	5.0/10	5.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	10.0/10	10.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	10.0/10	10.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	5.0/10	5.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	5.0/10	5.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文针对现有大语言模型在开放式写作任务中缺乏深度反思和修订模式导致改进有限的问题，提出了R2-Write框架，通过迭代的writer-judge交互合成高质量的思维轨迹，并设计过程奖励机制监督反思质量，从而显著提升了模型在创意写作和深度研究基准上的性能。

摘要翻译

尽管长链思维深度推理已在数学等可验证领域显著提升了大语言模型的性能，其在开放式任务（如写作）中的有效性仍未被探索。本文通过系统性研究发现，现有主流推理模型在开放式写作任务上取得的提升有限。进一步分析表明，这些模型在开放式写作中缺乏深度反思与修订模式，导致其改进幅度远小于数学推理任务。为应对这一局限，我们提出了R2-Write：一种通过迭代式“作者-评判者”交互生成高质量思维轨迹的自动化框架，该轨迹融合了显性的反思与修订模式。为避免冗余反思，我们设计了过程奖励机制，在强化学习过程中监督反思质量，从而提升性能并优化计算效率。在多项创意写作与深度研究基准上的广泛实验证明了该方法的显著改进，验证了显式融入反思与修订模式能够为开放式写作任务解锁深度推理能力。

摘要 (Abstract)

While deep reasoning with long chain-of-thought has dramatically improved large language models in verifiable domains like mathematics, its effectiveness for open-ended tasks such as writing remains unexplored. In this paper, we conduct a systematic investigation revealing that existing mainstream reasoning models achieve limited gains on open-ended writing tasks. Our further analysis shows that these models lack deep reflection and revision patterns in open-ended writing, resulting in substantially smaller improvements compared to mathematical reasoning tasks. To address this limitation, we introduce R2-Write: an automated framework that synthesizes high-quality thinking trajectories enriched with explicit reflection and revision patterns through iterative writer-judge interaction. To prevent redundant reflections, we design a process reward mechanism that supervises reflection quality during reinforcement learning, improving both performance and token efficiency. Extensive experiments across multiple creative writing and deep-research benchmarks demonstrate significant improvements, validating that explicitly incorporating reflection and revision patterns unlocks deep reasoning capabilities for open-ended writing tasks.

关键词: Large Language Models, Open-ended Writing, Chain-of-Thought, Reflection, Revision, Deep Reasoning, Reinforcement Learning, Writer-Judge Interaction

4. ✅ Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR

评分: 54.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	5.0/10	5.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	8.0/10	8.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	8.0/10	8.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	8.0/10	8.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文提出Speaker-Reasoner，一个用于多说话人自动语音识别（ASR）的端到端语音大语言模型（Speech LLM），核心创新在于采用代理式多轮时序推理（agentic multi-turn temporal reasoning）来处理重叠语音、快速话轮转换等挑战。因此，与"Large Language Models”（10分）和"LLM Agents"（10分）高度相关，因为模型被描述为具有代理推理能力的Speech LLM。与"Context Window Extension"（8分）相关，因为提到了speaker-aware cache以处理超出训练上下文窗口的音频。与"Chain of Thought"和"System 2 Thinking"（各8分）相关，因为模型采用迭代、多步的推理过程（分析全局结构、预测边界、细粒度分析）。与"Pre-training"和"Post-training"（各5分）有一定关联，因为提到了三阶段渐进训练策略，可能涉及预训练和微调。其他关键词如MoE、SLMs、Scaling Laws、Alignment、RAG、Quantization等，论文未直接涉及或不是核心，故给0分。

!!! tip deepseek-chat TL;DR

该论文针对多说话人场景下语音识别、说话人归属和时间戳定位的挑战，提出了Speaker-Reasoner——一种采用代理式多轮时序推理的端到端语音大语言模型，通过迭代分析和说话人感知缓存，在AliMeeting和AISHELL-4数据集上实现了对重叠语音和复杂话轮转换的持续改进。

摘要翻译

转录和理解多人对话需要完成语音识别、说话人归属和时间戳定位三项任务。虽然语音大语言模型在单人任务中表现出色，但在多人场景中，由于重叠语音、反馈性发声、快速话轮转换以及上下文窗口限制，该任务仍具挑战性。我们提出了Speaker-Reasoner，一种具备智能多轮时序推理能力的端到端语音大语言模型。该模型摒弃单次推理模式，转而迭代分析全局音频结构，自主预测时间边界，并进行细粒度片段分析，从而联合建模说话人身份、性别、时间戳和转录文本。一个说话人感知缓存机制进一步将处理能力扩展至超出训练上下文窗口的音频。通过采用三阶段渐进式策略进行训练，Speaker-Reasoner在AliMeeting和AISHELL-4数据集上相较于强基线模型取得了持续的性能提升，尤其在处理重叠语音和复杂话轮转换方面表现突出。

摘要 (Abstract)

Transcribing and understanding multi-speaker conversations requires speech recognition, speaker attribution, and timestamp localization. While speech LLMs excel at single-speaker tasks, multi-speaker scenarios remain challenging due to overlapping speech, backchannels, rapid turn-taking, and context window constraints. We propose Speaker-Reasoner, an end-to-end Speech LLM with agentic multi-turn temporal reasoning. Instead of single-pass inference, the model iteratively analyzes global audio structure, autonomously predicts temporal boundaries, and performs fine-grained segment analysis, jointly modeling speaker identity, gender, timestamps, and transcription. A speaker-aware cache further extends processing to audio exceeding the training context window. Trained with a three-stage progressive strategy, Speaker-Reasoner achieves consistent improvements over strong baselines on AliMeeting and AISHELL-4 datasets, particularly in handling overlapping speech and complex turn-taking.

关键词: Speech LLM, multi-speaker ASR, agentic reasoning, temporal reasoning, speaker attribution, timestamp localization, overlapping speech, context window extension

5. ✅ FoE: Forest of Errors Makes the First Solution the Best in Large Reasoning Models

作者: Kehan Jiang, Haonan Dong, Zhaolu Kang, Zhengzhou Zhu, Guojie Song 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02967v1

评分: 48.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	8.0/10	8.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	10.0/10	10.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	10.0/10	10.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文研究发现，在大型推理模型中，第一个解决方案往往最优，后续方案可能因错误累积而变差，并基于此提出了RED框架，通过优化第一个方案和修剪后续方案，在提升性能的同时显著减少计算开销。

摘要翻译

近期的大型推理模型（Large Reasoning Models, LRMs），如DeepSeek-R1，在复杂推理任务中展现出显著的成功，其探索多种替代解决方案的模式呈现出类人的特点。然而，通过更细致的观察，我们发现了一个令人惊讶的现象：“首选即最佳”，即替代方案不仅并非更优，反而可能具有损害性。这一观察挑战了被广泛接受的测试时缩放定律，促使我们提出假设：推理路径中的错误会随着测试时间同步增加。通过全面的实证分析，我们将错误表征为一种森林结构的错误森林（Forest of Errors, FoE），并得出结论：错误森林使得“首选即最佳”现象成立，这一结论得到了严格理论分析的支持。基于这些洞见，我们提出了RED——一个自引导的高效推理框架，包含两个组成部分：I) 精炼首选，旨在抑制首个解决方案中错误森林的生长；以及II) 舍弃后续，通过双重一致性对后续的错误森林进行剪枝。在五个基准测试和六个骨干模型上进行的大量实验表明，RED优于八个竞争性基线方法，实现了高达19.0%的性能提升，同时将令牌消耗降低了37.7% ~ 70.4%。此外，在错误森林指标上的对比实验揭示了RED实现有效性的内在机制。

摘要 (Abstract)

Recent Large Reasoning Models (LRMs) like DeepSeek-R1 have demonstrated remarkable success in complex reasoning tasks, exhibiting human-like patterns in exploring multiple alternative solutions. Upon closer inspection, however, we uncover a surprising phenomenon: The First is The Best, where alternative solutions are not merely suboptimal but potentially detrimental. This observation challenges widely accepted test-time scaling laws, leading us to hypothesize that errors within the reasoning path scale concurrently with test time. Through comprehensive empirical analysis, we characterize errors as a forest-structured Forest of Errors (FoE) and conclude that FoE makes the First the Best, which is underpinned by rigorous theoretical analysis. Leveraging these insights, we propose RED, a self-guided efficient reasoning framework comprising two components: I) Refining First, which suppresses FoE growth in the first solution; and II) Discarding Subs, which prunes subsequent FoE via dual-consistency. Extensive experiments across five benchmarks and six backbone models demonstrate that RED outperforms eight competitive baselines, achieving performance gains of up to 19.0% while reducing token consumption by 37.7% ~ 70.4%. Moreover, comparative experiments on FoE metrics shed light on how RED achieves effectiveness.

关键词: Large Reasoning Models, Forest of Errors, First Solution, Self-guided Reasoning, Error Scaling, Reasoning Efficiency, Token Consumption Reduction, Dual-consistency Pruning

6. ✅ Overcoming the “Impracticality” of RAG: Proposing a Real-World Benchmark and Multi-Dimensional Diagnostic Framework

评分: 41.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	10.0/10	10.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	5.0/10	5.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	8.0/10	8.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文针对现有检索增强生成系统在企业环境中评估不足的问题，提出了一个多维诊断框架和真实世界基准，以系统性地诊断RAG系统在推理复杂性、检索难度、文档结构和可解释性等方面的潜在弱点。

摘要翻译

企业环境中检索增强生成（RAG）系统的性能评估受多维复合因素支配，其范围远超简单的最终准确性检验。这些因素包括推理复杂性、检索难度、文档的多样化结构以及对操作可解释性的严格要求。现有学术基准未能系统性地诊断这些相互关联的挑战，导致一个关键差距：在学术评估中获得高分的模型在实际部署中往往无法达到预期的可靠性。为弥合这一差异，本研究提出一个多维诊断框架，通过定义四轴难度分类法并将其整合到企业RAG基准中，以诊断系统的潜在弱点。

摘要 (Abstract)

Performance evaluation of Retrieval-Augmented Generation (RAG) systems within enterprise environments is governed by multi-dimensional and composite factors extending far beyond simple final accuracy checks. These factors include reasoning complexity, retrieval difficulty, the diverse structure of documents, and stringent requirements for operational explainability. Existing academic benchmarks fail to systematically diagnose these interlocking challenges, resulting in a critical gap where models achieving high performance scores fail to meet the expected reliability in practical deployment. To bridge this discrepancy, this research proposes a multi-dimensional diagnostic framework by defining a four-axis difficulty taxonomy and integrating it into an enterprise RAG benchmark to diagnose potential system weaknesses.

关键词: Retrieval-Augmented Generation, RAG, enterprise benchmark, diagnostic framework, reasoning complexity, retrieval difficulty, explainability, system reliability

7. ✅ Improving Role Consistency in Multi-Agent Collaboration via Quantitative Role Clarity

作者: Guoling Zhou, Wenpei Han, Fengqin Yang, Li Wang, Yingcong Zhou, Zhiguo Fu 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02770v1

评分: 40.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	5.0/10	5.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	5.0/10	5.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	10.0/10	10.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文针对LLM驱动的多智能体系统中智能体不遵守角色规范的问题，提出了一种基于定量角色清晰度的正则化方法，通过轻量级微调显著提高了角色一致性和任务成功率。

摘要翻译

在大语言模型（LLM）驱动的多智能体系统中，角色规范违背（即未能遵循所分配角色的既定职责与约束，可能导致智能体行为类似于其他角色）是一种主要的故障模式。为应对此问题，本文提出一种定量的角色清晰度指标以提升角色一致性。首先，我们构建一个角色分配矩阵 $S(φ)=[s_{ij}(φ)]$，其中 $s_{ij}(φ)$ 表示第 $i$ 个智能体的行为轨迹与第 $j$ 个智能体的角色描述之间的语义相似度。随后，我们将角色清晰度矩阵 $M(φ)$ 定义为 $\text{softmax}(S(φ))-I$，其中 $\text{softmax}(S(φ))$ 是 $S(φ)$ 按行计算的 softmax 结果，$I$ 为单位矩阵。$M(φ)$ 的 Frobenius 范数可量化智能体角色描述与其行为轨迹之间的对齐程度。此外，我们在轻量级微调过程中将角色清晰度矩阵作为正则化项，以提升角色一致性，从而改善端到端任务性能。在 ChatDev 多智能体系统上的实验表明，我们的方法显著提升了角色一致性与任务性能：使用 Qwen 和 Llama 模型时，角色越界率分别从 $46.4%$ 降至 $8.4%$ 和从 $43.4%$ 降至 $0.2%$，角色清晰度得分分别从 $0.5328$ 提升至 $0.9097$ 和从 $0.5007$ 提升至 $0.8530$，任务成功率分别从 $0.6769$ 提升至 $0.6909$ 和从 $0.6174$ 提升至 $0.6763$。

摘要 (Abstract)

In large language model (LLM)-driven multi-agent systems, disobey role specification (failure to adhere to the defined responsibilities and constraints of an assigned role, potentially leading to an agent behaving like another) is a major failure mode \cite{DBLP:journals/corr/abs-2503-13657}. To address this issue, in the present paper, we propose a quantitative role clarity to improve role consistency. Firstly, we construct a role assignment matrix $S(φ)=[s_{ij}(φ)]$, where $s_{ij}(φ)$ is the semantic similarity between the $i$-th agent’s behavior trajectory and the $j$-th agent’s role description. Then we define role clarity matrix $M(φ)$ as $\text{softmax}(S(φ))-I$, where $\text{softmax}(S(φ))$ is a row-wise softmax of $S(φ)$ and $I$ is the identity matrix. The Frobenius norm of $M(φ)$ quantifies the alignment between agents’ role descriptions and their behaviors trajectory. Moreover, we employ the role clarity matrix as a regularizer during lightweight fine-tuning to improve role consistency, thereby improving end-to-end task performance. Experiments on the ChatDev multi-agent system show that our method substantially improves role consistency and task performance: with Qwen and Llama, the role overstepping rate decreases from $46.4%$ to $8.4%$ and from $43.4%$ to $0.2%$, respectively, and the role clarity score increases from $0.5328$ to $0.9097$ and from $0.5007$ to $0.8530$, respectively, the task success rate increases from $0.6769$ to $0.6909$ and from $0.6174$ to $0.6763$, respectively.

关键词: Large Language Models, Multi-agent Systems, Role Consistency, Role Clarity, Lightweight Fine-tuning, Agent Collaboration, Semantic Similarity, Task Performance

8. ✅ Supply-Chain Poisoning Attacks Against LLM Coding Agent Skill Ecosystems

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	5.0/10	5.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文研究了针对LLM编码代理技能生态系统的供应链攻击，提出了一种通过文档嵌入恶意负载的方法（DDIPE），在多个框架和模型中实现了11.6%至33.5%的绕过率，并发现了现有防御措施的漏洞。

摘要翻译

基于大语言模型的编码智能体通过开放市场中分发的第三方智能体技能扩展其能力，这些技能无需经过强制性安全审查。与传统软件包不同，这些技能以具备系统级权限的操作指令形式执行，因此单个恶意技能即可危害宿主系统。尽管现有防护机制存在，先前研究尚未探讨供应链攻击是否能够直接劫持智能体的操作空间（例如文件写入、shell命令和网络请求）。我们提出文档驱动的隐式载荷执行（Document-Driven Implicit Payload Execution, DDIPE）攻击方法，将恶意逻辑嵌入技能文档内的代码示例和配置模板中。由于智能体在常规任务中会复用这些示例，载荷得以在无需显式指令触发的情况下执行。通过采用大语言模型驱动的流程，我们从81个种子技能中生成覆盖15个MITRE ATTACK类别的1,070个对抗性技能。在四个框架和五个模型的测试中，DDIPE实现了11.6%至33.5%的绕过率，而显式指令攻击在强防御机制下的成功率则为0%。静态分析可检测大多数案例，但仍有2.5%的案例能同时规避检测和对齐机制。通过负责任披露，我们已确认四个相关漏洞并推动完成两项修复。

摘要 (Abstract)

LLM-based coding agents extend their capabilities via third-party agent skills distributed through open marketplaces without mandatory security review. Unlike traditional packages, these skills are executed as operational directives with system-level privileges, so a single malicious skill can compromise the host. Prior work has not examined whether supply-chain attacks can directly hijack an agent’s action space, such as file writes, shell commands, and network requests, despite existing safeguards. We introduce Document-Driven Implicit Payload Execution (DDIPE), which embeds malicious logic in code examples and configuration templates within skill documentation. Because agents reuse these examples during normal tasks, the payload executes without explicit prompts. Using an LLM-driven pipeline, we generate 1,070 adversarial skills from 81 seeds across 15 MITRE ATTACK categories. Across four frameworks and five models, DDIPE achieves 11.6% to 33.5% bypass rates, while explicit instruction attacks achieve 0% under strong defenses. Static analysis detects most cases, but 2.5% evade both detection and alignment. Responsible disclosure led to four confirmed vulnerabilities and two fixes.

关键词: LLM-based coding agents, supply-chain attacks, agent skills, Document-Driven Implicit Payload Execution, security vulnerabilities, adversarial skills, bypass rates, static analysis

9. ✅ Rubrics to Tokens: Bridging Response-level Rubrics and Token-level Rewards in Instruction Following Tasks

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	10.0/10	10.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	10.0/10	10.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文针对大语言模型在复杂开放域指令跟随任务中，基于准则的强化学习方法存在的奖励稀疏性和模糊性问题，提出了一种名为Rubrics to Tokens（RTT）的新框架，通过引入令牌级相关性判别器和RTT-GRPO优化方法，将粗粒度响应级评分与细粒度令牌级信用分配相结合，实验表明RTT在指令和准则级别准确性上均优于基线方法。

摘要翻译

基于量规的强化学习已成为使大语言模型与复杂开放域指令跟随任务对齐的有效方法。然而，现有方法主要依赖响应级奖励，导致严重的奖励稀疏性与奖励模糊性问题。为解决这些问题，我们提出“量规到令牌”方法，这是一种新颖的基于量规的强化学习框架，旨在连接粗粒度的响应级评分与细粒度的令牌级信用分配。该方法引入了一个令牌级相关性判别器，用于预测响应中哪些令牌应对特定约束负责，并通过RTT-GRPO算法优化策略模型，该算法将响应级与令牌级优势集成在统一框架内。此外，当基于令牌级量规的强化学习从一维结果级奖励转向三维奖励空间时，我们提出了一种新颖的组归一化方法——样本内令牌组归一化，以适应这一转变。大量实验与基准测试表明，在不同模型上，该方法在指令级与量规级准确率方面均持续优于其他基线模型。

摘要 (Abstract)

Rubric-based Reinforcement Learning (RL) has emerged as a promising approach for aligning Large Language Models (LLMs) with complex, open-domain instruction following tasks. However, existing methods predominantly rely on response-level rewards, introducing severe reward sparsity and reward ambiguity problems. To address these issues, we propose Rubrics to Tokens (RTT), a novel rubric-based RL framework that bridges coarse response-level scores and fine-grained token-level credit assignment. RTT introduces a Token-Level Relevance Discriminator to predict which tokens in the response are responsible for a specific constraint, and optimizes the policy model via RTT-GRPO, which integrates response-level and token-level advantages within a unified framework. Furthermore, when transitioning from one-dimensional, outcome-level reward to three-dimensional reward space in the token-level rubric-based RL, we propose a novel group normalization method, called Intra-sample Token Group Normalization, to accommodate this shift. Extensive experiments and benchmarks demonstrate that RTT consistently outperforms other baselines in both instruction- and rubric-level accuracy across different models.

关键词: Rubric-based Reinforcement Learning, Large Language Models, Instruction Following, Reward Sparsity, Token-level Credit Assignment, RTT-GRPO, Alignment, Benchmark Evaluation

10. ❌ Credential Leakage in LLM Agent Skills: A Large-Scale Empirical Study

作者: Zhihao Chen, Ying Zhang, Yi Liu, Gelei Deng, Yuekang Li, Yanjun Zhang, Jianting Ning, Leo Yu Zhang, Lei Ma, Zhiqiang Li 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03070v1

评分: 25.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	5.0/10	5.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文研究LLM Agent技能中的凭证泄漏问题，属于LLM Agent安全研究领域。核心相关关键词是"Large Language Models"和"LLM Agents”，论文直接研究LLM Agent技能的安全漏洞，因此给10分。“Tool Use"相关关键词得5分，因为技能扩展涉及工具使用，但论文重点在安全而非工具使用机制。其他关键词如MoE、Scaling Laws、Fine-tuning、RAG、Reasoning、Compression等均未涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文首次通过大规模实证研究揭示了LLM Agent第三方技能中存在的凭证泄漏风险，分析了17,022个技能，识别出520个易受攻击技能和10种泄漏模式，发现76.3%的泄漏需要代码和自然语言联合分析，并验证了泄漏凭证的可用性和持久性。

摘要翻译

第三方技能为大型语言模型（LLM）智能体扩展了强大能力，但常在特权环境中处理敏感凭证，其泄露风险尚未被充分认知。我们首次对此问题展开大规模实证研究，通过静态分析、沙箱测试和人工审查，分析了来自SkillsMP平台170,226项技能中抽样的17,022项。我们识别出520项存在漏洞的技能，共发现1,708个问题，并归纳出10类泄露模式（4类意外泄露与6类对抗性泄露）。研究发现：（1）泄露本质上是跨模态的：76.3%的案例需结合代码与自然语言进行联合分析，而3.1%完全源于提示词注入；（2）调试日志是主要泄露途径，因标准输出（stdout）暴露给LLM，print与console.log导致的泄露占比达73.5%；（3）泄露的凭证既具备可利用性（89.6%无需特权即可利用），又具有持续性——即使上游修复后，代码分支仍保留密钥。经披露后，所有恶意技能均被下架，91.6%的硬编码凭证得到修复。我们公开了数据集、分类体系与检测工具链，以支持后续研究。

摘要 (Abstract)

Third-party skills extend LLM agents with powerful capabilities but often handle sensitive credentials in privileged environments, making leakage risks poorly understood. We present the first large-scale empirical study of this problem, analyzing 17,022 skills (sampled from 170,226 on SkillsMP) using static analysis, sandbox testing, and manual inspection. We identify 520 vulnerable skills with 1,708 issues and derive a taxonomy of 10 leakage patterns (4 accidental and 6 adversarial). We find that (1) leakage is fundamentally cross-modal: 76.3% require joint analysis of code and natural language, while 3.1% arise purely from prompt injection; (2) debug logging is the primary vector, with print and console.log causing 73.5% of leaks due to stdout exposure to LLMs; and (3) leaked credentials are both exploitable (89.6% without privileges) and persistent, as forks retain secrets even after upstream fixes. After disclosure, all malicious skills were removed and 91.6% of hardcoded credentials were fixed. We release our dataset, taxonomy, and detection pipeline to support future research.

关键词: LLM Agents, Credential Leakage, Third-party Skills, Empirical Study, Security Vulnerability, Prompt Injection, Static Analysis, Sandbox Testing

11. ❌ Towards Secure Agent Skills: Architecture, Threat Taxonomy, and Security Analysis

作者: Zhiyuan Li, Jingzheng Wu, Xiang Ling, Xing Cui, Tianyue Luo 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02837v1

评分: 25.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	5.0/10	5.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究Agent Skills框架的安全分析，该框架是用于LLM-based agents的模块化打包标准。因此，与"Large Language Models"和"LLM Agents"高度相关（10分），因为论文明确研究基于LLM的代理框架。与"Tool Use"有一定关联（5分），因为Agent Skills涉及代理获取领域特定能力，可视为一种工具使用形式。其他关键词如MoE、SFT、RAG等涉及具体模型架构或训练技术，论文未涉及，故为0分。

!!! tip deepseek-chat TL;DR

该论文首次对新兴的Agent Skills框架（一种用于LLM代理的模块化打包标准）进行了全面的安全分析，识别了其生命周期各阶段的结构性攻击面，构建了威胁分类法，并通过实际安全事件验证，揭示了最严重的威胁源于框架本身的结构特性。

摘要翻译

Agent Skills 是一种新兴的开放标准，它定义了一种基于文件系统的模块化封装格式，使基于大语言模型（LLM）的智能体能够按需获取领域专业知识。尽管该标准已在多个智能体平台中迅速获得采纳，并催生了大型社区市场，但 Agent Skills 的安全特性尚未得到系统性研究。本文首次对 Agent Skills 框架进行了全面的安全分析。我们定义了 Agent Skill 在四个阶段——创建（Creation）、分发（Distribution）、部署（Deployment）和执行（Execution）——的完整生命周期，并识别了每个阶段引入的结构性攻击面。基于此生命周期分析，我们构建了一个威胁分类体系，该体系包含七大类共十七种攻击场景，并组织在三个攻击层面中，其依据既包括架构分析，也包含现实世界证据。我们通过分析 Agent Skills 生态系统中五个已确认的安全事件验证了该分类体系。基于这些发现，我们针对每类威胁讨论了防御方向，指出了开放的研究挑战，并为相关利益方提供了可操作的建议。我们的分析表明，最严重的威胁源于框架本身的结构特性，包括缺乏数据与指令的边界、单一审批的持久信任模型以及强制性市场安全审查的缺失，这些问题无法仅通过渐进式缓解措施来解决。

摘要 (Abstract)

Agent Skills is an emerging open standard that defines a modular, filesystem-based packaging format enabling LLM-based agents to acquire domain-specific expertise on demand. Despite rapid adoption across multiple agentic platforms and the emergence of large community marketplaces, the security properties of Agent Skills have not been systematically studied. This paper presents the first comprehensive security analysis of the Agent Skills framework. We define the full lifecycle of an Agent Skill across four phases – Creation, Distribution, Deployment, and Execution – and identify the structural attack surface each phase introduces. Building on this lifecycle analysis, we construct a threat taxonomy comprising seven categories and seventeen scenarios organized across three attack layers, grounded in both architectural analysis and real-world evidence. We validate the taxonomy through analysis of five confirmed security incidents in the Agent Skills ecosystem. Based on these findings, we discuss defense directions for each threat category, identify open research challenges, and provide actionable recommendations for stakeholders. Our analysis reveals that the most severe threats arise from structural properties of the framework itself, including the absence of a data-instruction boundary, a single-approval persistent trust model, and the lack of mandatory marketplace security review, and cannot be addressed through incremental mitigations alone.

关键词: Agent Skills, LLM-based agents, security analysis, threat taxonomy, attack surface, agentic platforms, modular packaging, security incidents

12. ❌ LLM-based Atomic Propositions help weak extractors: Evaluation of a Propositioner for triplet extraction

作者: Luc Pommeret, Thomas Gerald, Patrick Paroubek, Sahar Ghannay, Christophe Servan, Sophie Rosset 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02866v1

评分: 20.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	10.0/10	10.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLM（Qwen3-32B和Qwen3-0.6B）在知识图谱构建中的应用，通过知识蒸馏训练小型多语言模型（MPropositionneur-V2）来生成原子命题以改进三元组抽取。因此，与"Large Language Models"和"Small Language Models"高度相关（10分），因为论文明确使用并比较了大型和小型语言模型。其他关键词如MoE、Scaling Laws、Pre-training、Alignment等未在摘要中提及或涉及，因此评分为0分。论文属于大模型在信息抽取领域的应用研究，符合研究背景要求。

!!! tip deepseek-chat TL;DR

该论文研究了通过将文本分解为原子命题来改进知识图谱三元组抽取的方法，并证明这种方法能有效提升较弱抽取器的性能，同时为强LLM提供互补策略。

摘要翻译

从自然语言构建知识图谱需要从复杂且信息密集的句子中提取结构化三元组。本文探讨了将文本分解为原子命题（最小、语义自洽的信息单元）是否能改进三元组提取。我们介绍了 MPropositionneur-V2，这是一个覆盖六种欧洲语言的小型多语言模型，它通过知识蒸馏从 Qwen3-32B 训练至 Qwen3-0.6B 架构，并评估了其与两种提取范式的集成：以实体为中心的方法（GLiREL）和生成式方法（Qwen3）。在 SMiLER、FewRel、DocRED 和 CaRB 数据集上的实验表明，原子命题有助于提升较弱提取器（GLiREL、CoreNLP、0.6B 模型）的性能，改善了关系召回率，并在多语言环境下提升了整体准确率。对于更强的 LLMs，一种后备组合策略能够弥补实体召回率的损失，同时保持关系提取方面的增益。这些结果表明，原子命题是一种可解释的中间数据结构，能够补充而非替代现有提取器。

摘要 (Abstract)

Knowledge Graph construction from natural language requires extracting structured triplets from complex, information-dense sentences. In this paper, we investigate if the decomposition of text into atomic propositions (minimal, semantically autonomous units of information) can improve the triplet extraction. We introduce MPropositionneur-V2, a small multilingual model covering six European languages trained by knowledge distillation from Qwen3-32B into a Qwen3-0.6B architecture, and we evaluate its integration into two extraction paradigms: entity-centric (GLiREL) and generative (Qwen3). Experiments on SMiLER, FewRel, DocRED and CaRB show that atomic propositions benefit weaker extractors (GLiREL, CoreNLP, 0.6B models), improving relation recall and, in the multilingual setting, overall accuracy. For stronger LLMs, a fallback combination strategy recovers entity recall losses while preserving the gains in relation extraction. These results show that atomic propositions are an interpretable intermediate data structure that complements extractors without replacing them.

关键词: Knowledge Graph construction, triplet extraction, atomic propositions, multilingual model, knowledge distillation, Qwen3, entity-centric extraction, generative extraction

13. ❌ PRISM: LLM-Guided Semantic Clustering for High-Precision Topics

作者: Connor Douglas, Utkucan Balci, Joseph Aylett-Bullock 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03180v1

评分: 18.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	8.0/10	8.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文PRISM提出了一种结合LLM指导的语义聚类框架，核心是利用LLM提供稀疏标签来微调句子编码模型，属于大模型在文本分析领域的应用创新。与关键词高度相关的包括：1) “Large Language Models” (10分)：论文核心使用LLM作为教师模型提供监督信号；2) “Post-training” (8分)：涉及使用LLM标签对句子编码模型进行微调。其他关键词如MoE、SLMs、Scaling Laws、Instruction Tuning等与论文主题（主题建模、聚类）无直接关联，因此评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了PRISM框架，通过利用大语言模型（LLM）提供的稀疏标签来微调句子编码模型并进行阈值聚类，从而在多个语料库上实现了比现有主题模型更好的主题分离性，同时只需少量LLM查询即可训练。

摘要翻译

本文提出了一种精确信息语义建模框架，该结构化主题建模框架融合了大型语言模型所捕获的丰富表征优势与潜在语义聚类方法的低成本、高可解释性特点。PRISM通过从目标语料库中采样，利用少量由大型语言模型生成的标注样本对句子编码模型进行微调。我们通过阈值聚类对该嵌入空间进行划分，从而在特定狭窄领域内分离出紧密相关的主题簇。在多个语料库上的实验表明，PRISM在主题分离性上超越了当前最先进的局部主题模型，甚至优于基于前沿大型嵌入模型的聚类方法，同时仅需少量大型语言模型查询即可完成训练。本研究通过以下三个方面推动了多个研究方向的发展：提出了一种师生蒸馏管道，将稀疏的大型语言模型监督知识提炼为轻量级的主题发现模型；分析了采样策略对改善局部几何结构以提升聚类分离性的效能；提供了一种有效的网络规模文本分析方法，使研究者和实践者能够通过可解释、可本地部署的框架在线追踪细微观点与子话题演变。

摘要 (Abstract)

In this paper, we propose Precision-Informed Semantic Modeling (PRISM), a structured topic modeling framework combining the benefits of rich representations captured by LLMs with the low cost and interpretability of latent semantic clustering methods. PRISM fine-tunes a sentence encoding model using a sparse set of LLM- provided labels on samples drawn from some corpus of interest. We segment this embedding space with thresholded clustering, yielding clusters that separate closely related topics within some narrow domain. Across multiple corpora, PRISM improves topic separability over state-of-the-art local topic models and even over clustering on large, frontier embedding models while requiring only a small number of LLM queries to train. This work contributes to several research streams by providing (i) a student-teacher pipeline to distill sparse LLM supervision into a lightweight model for topic discovery; (ii) an analysis of the efficacy of sampling strategies to improve local geometry for cluster separability; and (iii) an effective approach for web-scale text analysis, enabling researchers and practitioners to track nuanced claims and subtopics online with an interpretable, locally deployable framework.

关键词: PRISM, LLM-guided, semantic clustering, topic modeling, fine-tuning, sentence encoding, sparse supervision, interpretable framework

14. ❌ Adaptive Semantic Communication for Wireless Image Transmission Leveraging Mixture-of-Experts Mechanism

作者: Haowen Wan, Qianqian Yang 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02691v1

评分: 10.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	10.0/10	10.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 该论文的核心创新点是提出了一种基于自适应MoE（Mixture of Experts）机制的无线图像语义通信系统。论文明确使用了MoE架构（“MoE-based semantic communication”），并提出了动态专家门控机制来联合评估信道状态和图像语义内容，实现自适应路由。因此，与"Mixture of Experts” OR “MoE” OR “Sparse Models"高度相关（10分）。论文属于深度学习在通信领域的应用，但未涉及大语言模型（LLMs）、科学AI应用（如生物信息学）或其他关键词所涵盖的具体技术（如RLHF、RAG、量化等），因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文针对无线图像传输中现有语义通信方案缺乏对多样化图像内容和动态信道条件适应性的问题，提出了一种基于自适应MoE Swin Transformer块的多阶段端到端系统，通过联合评估实时信道状态和图像语义内容的动态专家门控机制，显著提高了重建质量并保持了传输效率。

摘要翻译

基于深度学习的语义通信在无线图像传输领域取得了显著进展，但现有方案大多依赖固定模型，因而缺乏对多样化图像内容与动态信道条件的鲁棒性。为提升适应性，近期研究开发了可根据信源内容或信道状态调整传输策略或模型行为的自适应语义通信方法。最近，基于混合专家模型（MoE）的语义通信作为一种稀疏高效的自适应架构崭露头角，然而现有设计仍主要依赖单驱动路由机制。为突破这一局限，本文提出一种面向多输入多输出（MIMO）信道的新型多阶段端到端图像语义通信系统，其核心构建于自适应MoE Swin Transformer模块之上。具体而言，我们引入一种动态专家门控机制，该机制联合评估实时信道状态信息（CSI）与输入图像块的语义内容，以计算自适应路由概率。通过基于此联合条件选择性地仅激活特定专家子集，本方法打破了传统自适应方法的刚性耦合，克服了单驱动路由的瓶颈。仿真结果表明，在保持传输效率的同时，本系统在重建质量上较现有方法有显著提升。

摘要 (Abstract)

Deep learning based semantic communication has achieved significant progress in wireless image transmission, but most existing schemes rely on fixed models and thus lack robustness to diverse image contents and dynamic channel conditions. To improve adaptability, recent studies have developed adaptive semantic communication strategies that adjust transmission or model behavior according to either source content or channel state. More recently, MoE-based semantic communication has emerged as a sparse and efficient adaptive architecture, although existing designs still mainly rely on single-driven routing. To address this limitation, we propose a novel multi-stage end-to-end image semantic communication system for multi-input multi-output (MIMO) channels, built upon an adaptive MoE Swin Transformer block. Specifically, we introduce a dynamic expert gating mechanism that jointly evaluates both real-time CSI and the semantic content of input image patches to compute adaptive routing probabilities. By selectively activating only a specialized subset of experts based on this joint condition, our approach breaks the rigid coupling of traditional adaptive methods and overcomes the bottlenecks of single-driven routing. Simulation results indicate a significant improvement in reconstruction quality over existing methods while maintaining the transmission efficiency.

关键词: semantic communication, wireless image transmission, Mixture of Experts (MoE), adaptive routing, Swin Transformer, MIMO channels, dynamic expert gating, reconstruction quality

15. ❌ Speaking of Language: Reflections on Metalanguage Research in NLP

作者: Nathan Schneider, Antonios Anastasopoulos 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02645v1

评分: 5.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文聚焦于元语言（metalanguage）研究，仅与LLMs有一般性关联（摘要中提到’link it to NLP and LLMs’），但未深入探讨任何具体的大模型技术、应用或创新方法。其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

该论文探讨了元语言在NLP和LLMs中的定义、任务维度及未来研究方向，但未提出具体的技术创新或应用。

摘要翻译

本研究旨在聚焦元语言这一主题。我们首先界定了元语言的定义，将其与自然语言处理及大语言模型建立关联，继而阐述我们两个实验室围绕元语言开展的核心研究工作。最后，我们从四个维度探讨元语言及元语言任务，并提出一系列尚未充分探索的未来研究方向。

摘要 (Abstract)

This work aims to shine a spotlight on the topic of metalanguage. We first define metalanguage, link it to NLP and LLMs, and then discuss our two labs’ metalanguage-centered efforts. Finally, we discuss four dimensions of metalanguage and metalinguistic tasks, offering a list of understudied future research directions.

关键词: metalanguage, NLP, LLMs, metalinguistic tasks, future research directions, language analysis, linguistic reflection

16. ❌ THOM: Generating Physically Plausible Hand-Object Meshes From Text

作者: Uyoung Jeong, Yihalem Yimolal Tiruneh, Hyung Jin Chang, Seungryul Baek, Kwang In Kim 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02736v1

评分: 5.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

评分理由: 论文THOM专注于从文本生成物理上合理的手-物体交互3D网格，属于计算机视觉和图形学领域。其核心方法涉及高斯生成、网格提取、物理优化和VLM引导的细化，并未直接涉及大模型技术原理、训练方法、推理优化、对齐、代理系统等关键词。唯一的相关点是"AI for Science”，因为该研究可视为AI在科学/工程应用（机器人、VR/AR）中的一种应用，但并非生物信息学或化学信息学等典型科学领域，因此给予5分（有一定关联）。其他关键词与论文内容无直接关联，均评0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个无需训练的框架THOM，用于从文本生成具有高视觉真实感和物理合理性的3D手-物体交互网格，通过两阶段流程（高斯生成和物理优化）以及新的网格提取方法，在文本对齐、视觉真实性和交互合理性方面超越了现有方法。

摘要翻译

从文本生成三维手物交互对于灵巧机器人抓取与VR/AR内容生成至关重要，其既要求高视觉真实感，又需满足物理合理性。然而，从文本生成的高斯模型中提取网格这一不适定问题，以及在误差网格上进行基于物理的优化均构成挑战。为解决这些问题，我们提出THOM——一个无需训练、无需模板物体网格即可生成高真实感且物理合理的三维手物交互网格的框架。THOM采用两阶段流程：首先生成手部与物体的高斯模型，随后进行基于物理的手物交互优化。我们提出的新网格提取方法与顶点-高斯映射技术，显式地将高斯元素分配给网格顶点，从而实现拓扑感知的正则化。此外，通过视觉语言模型引导的位移优化与接触感知优化，我们进一步提升了交互的物理合理性。综合实验表明，THOM在文本对齐度、视觉真实感与交互合理性方面均持续超越现有先进方法。

摘要 (Abstract)

The generation of 3D hand-object interactions (HOIs) from text is crucial for dexterous robotic grasping and VR/AR content generation, requiring both high visual fidelity and physical plausibility. Nevertheless, the ill-posed problem of mesh extraction from text-generated Gaussians, and physics-based optimization on the erroneous meshes pose challenges. To address these issues, we introduce THOM, a training-free framework that generates photorealistic, physically plausible 3D HOI meshes without the need for a template object mesh. THOM employs a two-stage pipeline, initially generating the hand and object Gaussians, followed by physics-based HOI optimization. Our new mesh extraction method and vertex-to-Gaussian mapping explicitly assign Gaussian elements to mesh vertices, allowing topology-aware regularization. Furthermore, we improve the physical plausibility of interactions by VLM-guided translation refinement and contact-aware optimization. Comprehensive experiments demonstrate that THOM consistently surpasses state-of-the-art methods in terms of text alignment, visual realism, and interaction plausibility.

关键词: 3D hand-object interaction, text-to-3D generation, physically plausible meshes, Gaussian splatting, physics-based optimization, training-free framework, mesh extraction, contact-aware optimization

17. ❌ A semicontinuous relaxation of Saito’s criterion and freeness as angular minimization

作者: Tomás S. R. Silva 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02995v1

评分: 5.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

评分理由: 该论文研究代数几何中线排列的自由性问题，提出了一种半连续松弛方法，并使用强化学习进行序列构造。论文内容与绝大多数关键词（涉及大模型技术、训练方法、推理优化、对齐、代理等）完全无关。唯一略有相关的是"AI for Science"，因为论文将强化学习应用于数学问题的计算探索，属于AI在科学领域的应用，但并非核心生物信息学或化学信息学，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于半连续松弛的几何方法来度量线排列与自由性之间的距离，并开发了使用强化学习的序列构造程序来最小化这种距离，从而为线排列自由性的计算探索提供了新途径。

摘要翻译

我们在 $\mathbb{P}^2$ 中的直线构型空间上引入一个非负泛函，该泛函恰好在对自由构型时为零，它是通过佐藤关于自由性的判据进行半连续松弛而得到的。给定一个包含 $n$ 条直线且候选指数为 $(d_1, d_2)$ 的构型 $\mathcal{A}$，我们通过相关导子矩阵的零空间参数化次数为 $d_1$ 和 $d_2$ 的对数导子空间，并将佐藤行列式表达为一个到 $n$ 次多项式空间的双线性映射。该泛函由此获得一个自然的几何解释：它度量了此双线性映射的像与系数空间中由定义多项式 $Q(\mathcal{A})$ 张成的方向之间夹角的正弦平方，且当且仅当其像包含 $Q(\mathcal{A})$ 张成的直线时取零值。这为衡量给定构型距离具有预期次数的自由对数导子基有多远提供了一个可计算的度量。
利用该泛函作为奖励信号，我们开发了一种顺序构造过程：在其中逐条添加直线，以最小化与自由性之间的角距离；该过程通过强化学习实现，并在构型规模和指数类型上采用自适应课程学习。
我们的结果表明，基于多项式系数空间几何的半连续松弛技术，为直线构型理论中自由性的计算探索提供了一条可行途径。

摘要 (Abstract)

We introduce a nonnegative functional on the space of line arrangements in $\mathbb{P}^2$ that vanishes precisely on free arrangements, obtained as a semicontinuous relaxation of Saito’s criterion for freeness. Given an arrangement $\mathcal{A}$ of $n$ lines with candidate exponents $(d_1, d_2)$, we parameterize the spaces of logarithmic derivations of degrees $d_1$ and $d_2$ via the null spaces of the associated derivation matrices and express the Saito determinant as a bilinear map into the space of degree $n$ polynomials. The functional then admits a natural geometric interpretation: it measures the squared sine of the angle between the image of this bilinear map and the direction of the defining polynomial $Q(\mathcal{A})$ in coefficient space, and equals zero if and only if its image contains the line spanned by $Q(\mathcal{A})$. This provides a computable measure of how far a given arrangement is from admitting a free basis of logarithmic derivations of the expected degrees. Using this functional as a reward signal, we develop a sequential construction procedure in which lines are added one at a time so as to minimize the angular distance to freeness, implemented via reinforcement learning with an adaptive curriculum over arrangement sizes and exponent types. Our results suggest that semicontinuous relaxation techniques, grounded in the geometry of polynomial coefficient spaces, offer a viable approach to the computational exploration of freeness in the theory of line arrangements.

关键词: line arrangements, freeness, Saito’s criterion, semicontinuous relaxation, logarithmic derivations, reinforcement learning, computational exploration, geometric interpretation

18. ❌ Enhancing Robustness of Federated Learning via Server Learning

作者: Van Sy Mai, Kushal Chakrabarti, Richard J. La, Dipankar Maity 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03226v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于联邦学习的鲁棒性增强技术，通过服务器学习和客户端更新过滤结合几何中位数聚合来防御恶意攻击。论文内容完全围绕联邦学习的安全性和分布式机器学习展开，未涉及任何大语言模型、深度学习技术原理、科学AI应用或评分关键词中列出的具体技术（如MoE、RLHF、RAG、量化等）。所有关键词均与大模型和深度学习技术相关，而本文研究的是传统的联邦学习框架，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

本文研究如何通过服务器学习和客户端更新过滤结合几何中位数聚合来增强联邦学习在非独立同分布数据和大量恶意客户端攻击下的鲁棒性，实验表明该方法能在恶意客户端比例超过50%时显著提升模型准确性。

摘要翻译

本文探讨了在客户端训练数据非独立同分布的情况下，利用服务器学习增强联邦学习对抗恶意攻击的鲁棒性。我们提出了一种启发式算法，该算法结合服务器学习与客户端更新过滤，并采用几何中位数聚合方法。实验表明，即使恶意客户端比例较高（某些情况下超过$50%$），且服务器使用的数据集规模较小（可为合成数据，其分布未必接近客户端聚合数据的分布），该方法仍能显著提升模型精度。

摘要 (Abstract)

This paper explores the use of server learning for enhancing the robustness of federated learning against malicious attacks even when clients’ training data are not independent and identically distributed. We propose a heuristic algorithm that uses server learning and client update filtering in combination with geometric median aggregation. We demonstrate via experiments that this approach can achieve significant improvement in model accuracy even when the fraction of malicious clients is high, even more than $50%$ in some cases, and the dataset utilized by the server is small and could be synthetic with its distribution not necessarily close to that of the clients’ aggregated data.

关键词: Federated Learning, Robustness, Malicious Attacks, Server Learning, Geometric Median Aggregation, Non-IID Data, Client Update Filtering, Model Accuracy

19. ❌ PR3DICTR: A modular AI framework for medical 3D image-based detection and outcome prediction

作者: Daniel C. MacRae, Luuk van der Hoek, Robert van der Wal, Suzanne P. M. de Vette, Hendrike Neh, Baoqiang Ma, Peter M. A. van Ooijen, Lisanne V. van Dijk 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03203v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文介绍了一个用于医学3D图像分类和预测模型开发的模块化AI框架PR3DICTR，主要涉及深度学习在医学影像分析中的应用。论文内容与绝大多数关键词（主要关于大语言模型技术原理、训练方法、推理优化、智能体等）完全无关，因为这些关键词针对的是自然语言处理领域的大模型技术。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在生物医学（医学影像分析）领域的应用，属于’AI for Science’范畴，但并非核心创新技术研究，因此给予8分（有一定关联，但非核心）。

!!! tip deepseek-chat TL;DR

该论文提出了一个名为PR3DICTR的模块化AI框架，用于简化基于3D医学图像的检测和结果预测模型的开发，通过标准化和灵活设计降低开发负担。

摘要翻译

三维医学影像数据与计算机辅助决策（尤其是基于深度学习的方法）在医学领域正变得日益重要。为促进相关研究发展，我们推出PR3DICTR：三维影像分类与标准化训练研究平台。该平台基于学界通用框架（PyTorch与MONAI）构建，为预测模型开发提供开放访问、灵活便捷的框架，特别专注于三维医学影像数据的分类任务。通过模块化设计原则与标准化流程的结合，本平台旨在减轻开发负担的同时保持可调整性。它为用户提供了丰富的预设功能，例如模型架构设计选项、超参数解决方案及训练方法学，同时仍允许用户自由“接入”自定义解决方案或模块。PR3DICTR可应用于任何基于事件或二分类的三维影像分类任务，且仅需两行代码即可运行。

摘要 (Abstract)

Three-dimensional medical image data and computer-aided decision making, particularly using deep learning, are becoming increasingly important in the medical field. To aid in these developments we introduce PR3DICTR: Platform for Research in 3D Image Classification and sTandardised tRaining. Built using community-standard distributions (PyTorch and MONAI), PR3DICTR provides an open-access, flexible and convenient framework for prediction model development, with an explicit focus on classification using three-dimensional medical image data. By combining modular design principles and standardization, it aims to alleviate developmental burden whilst retaining adjustability. It provides users with a wealth of pre-established functionality, for instance in model architecture design options, hyper-parameter solutions and training methodologies, but still gives users the opportunity and freedom to ``plug in’’ their own solutions or modules. PR3DICTR can be applied to any binary or event-based three-dimensional classification task and can work with as little as two lines of code.

关键词: medical 3D image, deep learning, classification, prediction model, modular framework, standardized training, computer-aided decision, PyTorch MONAI

20. ❌ Coupled Control, Structured Memory, and Verifiable Action in Agentic AI (SCRAT – Stochastic Control with Retrieval and Auditable Trajectories): A Comparative Perspective from Squirrel Locomotion and Scatter-Hoarding

作者: Maximiliano Armesto, Christophe Kolb 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03201v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是智能体AI（Agentic AI）的控制、记忆和验证问题，从松鼠生态学中获得灵感，提出一个包含潜在动态、结构化情景记忆、观察者信念状态、选项级动作和延迟验证信号的分层部分可观测控制模型。虽然论文涉及AI智能体，但所有关键词都明确针对大语言模型（LLM）及其相关技术（如MoE、SFT、RLHF、RAG、量化等），而本文完全不涉及LLM技术，没有讨论任何语言模型、预训练、微调、对齐、推理加速或特定LLM应用。论文的核心是控制理论、记忆架构和验证机制，而非LLM技术。因此，所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

本文从松鼠生态学中获得灵感，研究智能体AI在部分可观测环境下的控制、记忆和验证的耦合问题，提出了一个分层部分可观测控制模型，并假设这种耦合能提高鲁棒性、记忆检索效率和减少静默失败。

摘要翻译

对智能体人工智能的评价正日益超越其流畅的输出能力，转而关注其在部分可观测性、延迟和策略性观察条件下能否有效行动、记忆与验证。现有研究往往孤立探讨这些需求：机器人学侧重控制，检索系统强调记忆，对齐或保障工作则聚焦核查与监督。本文提出，松鼠生态学为此提供了一个鲜明的比较案例，因为树冠运动、分散贮食及受众敏感性贮藏行为将这三类需求耦合于单一生物体内。我们综合了狐狸松鼠、东部灰松鼠及一项野外对比中的红松鼠证据，并构建了明确的推理阶梯：实证观察、最小化计算推断与人工智能设计猜想。我们提出一个具有潜在动态、结构化情景记忆、观察者信念状态、选项级行动及延迟验证信号的最小化分层部分可观测控制模型。由此推导出三个假设：（H1）快速局部反馈结合预测性补偿能提升隐藏动态变化下的鲁棒性；（H2）为未来控制而组织的记忆结构可在线索冲突与负荷下改善延迟检索；（H3）行动-记忆循环内嵌的验证器与观察者模型能减少静默故障与信息泄露，但仍存在模型设定错误的脆弱性。进一步推论是：角色分化的提议者/执行者/核查者/对抗者系统或能在信息不对称与验证负担下降低关联错误。本文贡献在于提出一种比较视角与基准研究议程：建立一套关于控制、记忆与可验证行动耦合关系的可证伪主张的规范研究体系。

摘要 (Abstract)

Agentic AI is increasingly judged not by fluent output alone but by whether it can act, remember, and verify under partial observability, delay, and strategic observation. Existing research often studies these demands separately: robotics emphasizes control, retrieval systems emphasize memory, and alignment or assurance work emphasizes checking and oversight. This article argues that squirrel ecology offers a sharp comparative case because arboreal locomotion, scatter-hoarding, and audience-sensitive caching couple all three demands in one organism. We synthesize evidence from fox, eastern gray, and, in one field comparison, red squirrels, and impose an explicit inference ladder: empirical observation, minimal computational inference, and AI design conjecture. We introduce a minimal hierarchical partially observed control model with latent dynamics, structured episodic memory, observer-belief state, option-level actions, and delayed verifier signals. This motivates three hypotheses: (H1) fast local feedback plus predictive compensation improves robustness under hidden dynamics shifts; (H2) memory organized for future control improves delayed retrieval under cue conflict and load; and (H3) verifiers and observer models inside the action-memory loop reduce silent failure and information leakage while remaining vulnerable to misspecification. A downstream conjecture is that role-differentiated proposer/executor/checker/adversary systems may reduce correlated error under asymmetric information and verification burden. The contribution is a comparative perspective and benchmark agenda: a disciplined program of falsifiable claims about the coupling of control, memory, and verifiable action.

关键词: Agentic AI, Stochastic Control, Structured Memory, Verifiable Action, Partial Observability, Hierarchical Control, Episodic Memory, Observer Model

21. ❌ Reliability Gated Multi-Teacher Distillation for Low Resource Abstractive Summarization

作者: Dipto Sumit, Ankan Kumar Roy, Sadia Khair Rodela, Atia Haque Asha, Mourchona Afrin, Niloy Farhan, Farig Yousuf Sadeque 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03192v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究多教师知识蒸馏在低资源抽象摘要中的应用，提出了EWAD和CPDP方法。与’Large Language Models’相关度5分，因为使用了Qwen2.5等大模型进行实验；与’Post-training’相关度5分，因为知识蒸馏属于模型训练后优化技术。其他关键词如MoE、SLMs、Scaling Laws、RLHF等均未涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文研究了可靠性感知的多教师知识蒸馏方法（EWAD和CPDP）在低资源抽象摘要任务中的应用，发现logit级蒸馏提供最可靠的性能提升，而跨语言伪标签蒸馏在3.2倍压缩下能保留71-122%的教师模型性能。

摘要翻译

我们从可靠性感知的视角研究低资源抽象摘要中的多教师知识蒸馏。我们引入了EWAD（熵加权一致性感知蒸馏），这是一种基于教师间一致性在教师蒸馏与黄金标准监督之间进行路由的标记级机制；以及CPDP（容量比例散度保持），一种对学生模型相对于异构教师模型位置的几何约束。通过在两个孟加拉语数据集、13个孟加拉语T5消融实验和八个Qwen2.5实验中的验证，我们发现逻辑值层面的知识蒸馏能带来最稳定的性能提升，而更复杂的蒸馏方法虽能提升短摘要的语义相似度，却会损害长摘要的质量。在十种语言上进行的跨语言伪标签知识蒸馏，在3.2倍压缩率下仍能保留教师模型71-122%的ROUGE L分数。一项经人工验证的多评委大语言模型评估进一步揭示了单评委评估流程中存在的校准偏差。总体而言，我们的结果表明，可靠性感知蒸馏有助于界定多教师监督在何种情况下能改进摘要性能，以及在何种情况下数据扩展的收益会超过损失函数设计的收益。

摘要 (Abstract)

We study multiteacher knowledge distillation for low resource abstractive summarization from a reliability aware perspective. We introduce EWAD (Entropy Weighted Agreement Aware Distillation), a token level mechanism that routes supervision between teacher distillation and gold supervision based on inter teacher agreement, and CPDP (Capacity Proportional Divergence Preservation), a geometric constraint on the student position relative to heterogeneous teachers. Across two Bangla datasets, 13 BanglaT5 ablations, and eight Qwen2.5 experiments, we find that logit level KD provides the most reliable gains, while more complex distillation improves semantic similarity for short summaries but degrades longer outputs. Cross lingual pseudo label KD across ten languages retains 71-122 percent of teacher ROUGE L at 3.2x compression. A human validated multi judge LLM evaluation further reveals calibration bias in single judge pipelines. Overall, our results show that reliability aware distillation helps characterize when multi teacher supervision improves summarization and when data scaling outweighs loss engineering.

关键词: knowledge distillation, abstractive summarization, low resource, multi-teacher, reliability aware, cross-lingual, model compression, LLM evaluation

22. ❌ Gradient Boosting within a Single Attention Layer

作者: Saleh Sargolzaei 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03190v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种新的注意力机制变体（gradient-boosted attention），属于Transformer架构的底层技术创新。它直接改进注意力计算本身，因此与’Large Language Models’有一定关联（因为LLMs基于Transformer），但论文本身不专门研究LLMs，而是关注通用的注意力机制改进。其他关键词主要涉及LLM的训练、对齐、推理、应用等高层主题，与本文的底层架构创新无关。

!!! tip deepseek-chat TL;DR

该论文针对标准Transformer注意力机制无法自我纠错的局限性，提出了一种梯度提升注意力机制，通过在单个注意力层内引入第二个带有门控校正的注意力通道来纠正第一个通道的预测误差，在WikiText-103子集上实现了比标准注意力更低的测试困惑度。

摘要翻译

Transformer注意力机制通过单一softmax加权平均计算值向量——这种单次估计无法修正自身误差。我们提出梯度提升注意力，将梯度提升原理应用于单个注意力层内部：第二个注意力轮次使用独立学习的投影矩阵，关注首轮预测误差并施加门控修正。在平方重构目标下，该结构可映射至弗里德曼的梯度提升机框架，其中每个注意力轮次作为基学习器，逐维度门控机制充当收缩参数。我们证明：单次霍普菲尔德式更新会消除查询向量中与存储模式子空间正交的所有信息；而在局部收缩条件下的进一步迭代，可能导致同一区域内不同查询向量坍缩至相同不动点。研究还表明，修正轮次采用独立投影可恢复塔基“二次修正法”共享投影方案无法获取的残差信息。在WikiText-103的1000万词元子集上，梯度提升注意力取得67.9的测试困惑度，优于标准注意力的72.2、Twicing注意力的69.6以及参数匹配的加宽基线模型（69.0），且两轮迭代即可获得主要性能增益。

摘要 (Abstract)

Transformer attention computes a single softmax-weighted average over values – a one-pass estimate that cannot correct its own errors. We introduce \emph{gradient-boosted attention}, which applies the principle of gradient boosting \emph{within} a single attention layer: a second attention pass, with its own learned projections, attends to the prediction error of the first and applies a gated correction. Under a squared reconstruction objective, the construction maps onto Friedman’s gradient boosting machine, with each attention pass as a base learner and the per-dimension gate as the shrinkage parameter. We show that a single Hopfield-style update erases all query information orthogonal to the stored-pattern subspace, and that further iteration under local contraction can collapse distinct queries in the same region to the same fixed point. We also show that separate projections for the correction pass can recover residual information inaccessible to the shared-projection approach of Tukey’s twicing. On a 10M-token subset of WikiText-103, gradient-boosted attention achieves a test perplexity of $67.9$ compared to $72.2$ for standard attention, $69.6$ for Twicing Attention, and $69.0$ for a parameter-matched wider baseline, with two rounds capturing most of the benefit.

关键词: gradient-boosted attention, attention mechanism, Transformer, error correction, Hopfield network, perplexity, WikiText-103, Twicing Attention

23. ❌ Reflective Context Learning: Studying the Optimization Primitives of Context Space

作者: Nikita Vassilyev, William Berrios, Ruowang Zhang, Bo Han, Douwe Kiela, Shikib Mehri 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03189v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	10.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出Reflective Context Learning (RCL)框架，专注于智能体通过上下文空间学习优化。核心相关关键词：1) ‘LLM Agents/Autonomous Agents/Agentic Workflow’ (10分)：论文研究智能体学习框架，是核心内容；2) ‘In-context Learning/Many-shot Learning’ (10分)：论文研究上下文空间学习优化，是核心创新；3) ‘Self-Correction/Self-Improvement/Self-Reflection’ (8分)：RCL框架包含反思机制；4) ‘Large Language Models/LLMs/Foundation Models’ (5分)：论文涉及智能体学习，可能使用大模型，但未明确说明。其他关键词与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

论文提出Reflective Context Learning (RCL)框架，通过反思和迭代更新上下文来解决智能体在上下文空间学习中的优化问题，并在多个基准测试中展示了改进效果。

摘要翻译

通用智能体必须通过经验学习，使其能力能够跨任务与环境泛化。无论是学习对象位于参数空间还是上下文空间，学习的基本问题——包括信用分配、过拟合、遗忘、局部最优和高方差学习信号——始终存在。尽管这些挑战在经典机器学习优化中已得到充分理解，但在上下文空间中仍未得到充分探索，导致现有方法零散且临时化。我们提出反射式上下文学习（Reflective Context Learning, RCL），这是一个为智能体设计的统一框架，使其能够通过重复交互、对行为与失败模式的反思以及对上下文的迭代更新进行学习。在RCL中，反思将轨迹与当前上下文转换为类似于梯度的方向性更新信号，而变异则应用该信号以在上下文空间中改进未来行为。我们将近期的上下文优化方法重新定义为这一共享学习问题的实例，并系统性地扩展了经典优化原语，包括批处理、改进的信用分配信号、辅助损失、失败回放以及用于降低方差的组式轨迹采样。在AppWorld、BrowseComp+和RewardBench2上的实验表明，这些原语在强基线基础上实现了性能提升，且其相对重要性随任务机制的变化而改变。我们进一步分析了方法对初始化的鲁棒性、批量大小的影响、采样与课程策略、优化器状态变体，以及为不同优化组件分配更强或更弱模型的效果。我们的结果表明，通过上下文更新进行学习不应被视为一系列孤立的算法，而应被视为一个优化问题，其机制可以通过可迁移的原则进行系统性研究与改进。

摘要 (Abstract)

Generally capable agents must learn from experience in ways that generalize across tasks and environments. The fundamental problems of learning, including credit assignment, overfitting, forgetting, local optima, and high-variance learning signals, persist whether the learned object lies in parameter space or context space. While these challenges are well understood in classical machine learning optimization, they remain underexplored in context space, leading current methods to be fragmented and ad hoc. We present Reflective Context Learning (RCL), a unified framework for agents that learn through repeated interaction, reflection on behavior and failure modes, and iterative updates to context. In RCL, reflection converts trajectories and current context into a directional update signal analogous to gradients, while mutation applies that signal to improve future behavior in context space. We recast recent context-optimization approaches as instances of this shared learning problem and systematically extend them with classical optimization primitives, including batching, improved credit-assignment signal, auxiliary losses, failure replay, and grouped rollouts for variance reduction. On AppWorld, BrowseComp+, and RewardBench2, these primitives improve over strong baselines, with their relative importance shifting across task regimes. We further analyze robustness to initialization, the effects of batch size, sampling and curriculum strategy, optimizer-state variants, and the impact of allocating stronger or weaker models to different optimization components. Our results suggest that learning through context updates should be treated not as a set of isolated algorithms, but as an optimization problem whose mechanisms can be studied systematically and improved through transferable principles.

关键词: Reflective Context Learning, context space optimization, agent learning, reflection mechanism, credit assignment, variance reduction, iterative updates, optimization primitives

24. ❌ Understanding the Role of Hallucination in Reinforcement Post-Training of Multimodal Reasoning Models

作者: Gengwei Zhang, Jie Peng, Zhen Tan, Mufan Qiu, Hossein Nourkhiz Mahjoub, Vaishnav Tadiparthi, Kwonjoon Lee, Yanyong Zhang, Tianlong Chen 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03179v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	15.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究RL-based post-training对多模态大语言模型(MLLMs)推理能力的影响，特别是从幻觉角度分析。高度相关的关键词包括：‘Large Language Models’(论文研究MLLMs)、‘Post-training’(研究RL post-training)、‘RLHF’(使用RL进行后训练)、‘Chain of Thought’(涉及推理过程分析)、‘Hallucination Mitigation’(提出Hallucination-as-Cue框架分析幻觉)、‘Mechanistic Interpretability’(通过框架诊断训练动态和模型属性)。其他关键词如MoE、SLMs、Scaling Laws等与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文研究了强化学习后训练对多模态大语言模型视觉推理能力的影响，通过提出的Hallucination-as-Cue框架发现模型幻觉在RL训练中的作用比先前认识更为重要，甚至在纯幻觉诱导设置下也能显著提升推理性能。

摘要翻译

强化学习（RL）在大型推理模型中的近期成功，推动了其在后训练多模态大语言模型（MLLMs）中的日益广泛应用，以增强其视觉推理能力。尽管许多研究报道了性能提升，但RL训练是否真正使模型能够从视觉信息中学习仍不明确。在本研究中，我们提出“幻觉即线索”框架，这是一个旨在从模型幻觉角度分析基于RL的后训练对多模态推理模型影响的分析框架。具体而言，我们引入了诱导幻觉的、模态特定的信息破坏方法，通过移除或替换得出正确答案所需的关键信息，从而迫使模型通过幻觉进行推理。通过在训练和评估阶段应用这些破坏方法，我们的框架为诊断RL训练动态和理解数据集的内在特性提供了独特视角。通过对多个多模态推理基准进行大量实验和分析，我们发现模型幻觉在RL训练中的作用比以往认识到的更为显著。例如，我们发现即使在纯诱导幻觉的设置下进行RL后训练，仍能显著提升模型的推理性能，在某些情况下甚至超越标准训练。这些发现挑战了当前关于MLLM推理训练的普遍假设，并推动了更具模态感知能力的基于RL的训练设计的进一步发展。

摘要 (Abstract)

The recent success of reinforcement learning (RL) in large reasoning models has inspired the growing adoption of RL for post-training Multimodal Large Language Models (MLLMs) to enhance their visual reasoning capabilities. Although many studies have reported improved performance, it remains unclear whether RL training truly enables models to learn from visual information. In this work, we propose the Hallucination-as-Cue Framework, an analytical framework designed to investigate the effects of RL-based post-training on multimodal reasoning models from the perspective of model hallucination. Specifically, we introduce hallucination-inductive, modality-specific corruptions that remove or replace essential information required to derive correct answers, thereby forcing the model to reason by hallucination. By applying these corruptions during both training and evaluation, our framework provides a unique perspective for diagnosing RL training dynamics and understanding the intrinsic properties of datasets. Through extensive experiments and analyses across multiple multimodal reasoning benchmarks, we reveal that the role of model hallucination for RL-training is more significant than previously recognized. For instance, we find that RL post-training under purely hallucination-inductive settings can still significantly improve models’ reasoning performance, and in some cases even outperform standard training. These findings challenge prevailing assumptions about MLLM reasoning training and motivate the development of more modality-aware RL-based training designs.

关键词: Multimodal Large Language Models, Reinforcement Learning, Post-training, Visual Reasoning, Hallucination, RL-based Training, Multimodal Reasoning, Training Dynamics

25. ❌ Beyond the Parameters: A Technical Survey of Contextual Enrichment in Large Language Models: From In-Context Prompting to Causal Retrieval-Augmented Generation

作者: Prakhar Bansal, Shivangi Agarwal 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03174v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	5.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	10.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文是LLM增强策略的技术综述，核心讨论in-context learning和RAG（包括GraphRAG和CausalRAG），因此与’Large Language Models’、‘Retrieval-Augmented Generation’和’In-context Learning’高度相关（10分）。论文提到有限上下文窗口和因果推理，与’Context Window Extension’和’Hallucination Mitigation’有一定关联（5分）。其他关键词如MoE、量化、对齐等未涉及，得0分。

!!! tip deepseek-chat TL;DR

这篇综述论文系统分析了大型语言模型（LLMs）的上下文增强策略，从in-context prompting到因果检索增强生成（CausalRAG），并提出了文献筛选协议、声明审计框架和部署决策框架，以提升检索增强NLP的可信度。

摘要翻译

大型语言模型（LLMs）在其参数中编码了海量的世界知识，但它们本质上仍受限于静态知识、有限的上下文窗口以及弱结构化的因果推理能力。本综述沿单一轴线——即在推理时提供的结构化上下文程度——对增强策略进行了统一阐述。我们涵盖了上下文学习与提示工程、检索增强生成（RAG）、图结构检索增强生成（GraphRAG）以及因果检索增强生成（CausalRAG）。除概念比较外，本文还提供了透明的文献筛选流程、主张审计框架以及结构化的跨文献证据综合方法，以区分高置信度结论与新兴研究成果。论文最后提出了面向部署的决策框架，并为可信赖的检索增强自然语言处理领域明确了具体的研究优先方向。

摘要 (Abstract)

Large language models (LLMs) encode vast world knowledge in their parameters, yet they remain fundamentally limited by static knowledge, finite context windows, and weakly structured causal reasoning. This survey provides a unified account of augmentation strategies along a single axis: the degree of structured context supplied at inference time. We cover in-context learning and prompt engineering, Retrieval-Augmented Generation (RAG), GraphRAG, and CausalRAG. Beyond conceptual comparison, we provide a transparent literature-screening protocol, a claim-audit framework, and a structured cross-paper evidence synthesis that distinguishes higher-confidence findings from emerging results. The paper concludes with a deployment-oriented decision framework and concrete research priorities for trustworthy retrieval-augmented NLP.

关键词: Large Language Models, Contextual Enrichment, In-context Learning, Retrieval-Augmented Generation, RAG, GraphRAG, CausalRAG, Trustworthy NLP

26. ❌ Chart-RL: Policy Optimization Reinforcement Learning for Enhanced Visual Reasoning in Chart Question Answering with Vision Language Models

作者: Yunfei Bai, Amit Dhanda, Shekhar Jain 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03157v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究视觉语言模型（VLM）在图表问答任务中的增强，通过强化学习框架优化视觉感知和逻辑推理。与关键词的相关性分析如下：1）与"Large Language Models OR LLMs OR Foundation Models"（8分）相关，因为论文使用Qwen3-VL作为基础模型；2）与"Post-training OR Supervised Fine-tuning OR SFT"（8分）相关，因为论文涉及模型微调；3）与"PEFT OR LoRA OR Parameter-efficient Fine-tuning"（10分）高度相关，因为论文明确集成了LoRA进行参数高效微调；4）与"Chain of Thought OR CoT Reasoning OR Multi-step Reasoning"（8分）和"System 2 Thinking OR Slow Thinking OR In-depth Reasoning"（8分）相关，因为论文强调增强推理能力；5）与"Speculative Decoding OR Inference Acceleration"（5分）有一定关联，因为论文提到减少推理延迟；其他关键词与论文内容无关或未提及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对视觉语言模型在图表问答任务中存在的数值提取不精确、视觉关系理解困难等问题，提出了Chart-RL强化学习框架，通过集成LoRA进行参数高效微调，显著提升了模型性能并减少了推理延迟。

摘要翻译

视觉语言模型（VLMs）的最新进展展示了迈向真正智能所需的稳健推理能力的进步。除了模式识别之外，语言推理必须与视觉理解相结合，尤其是在涉及复杂数据可视化的图表问答（Chart Question Answering, CQA）任务中。当前的视觉语言模型在图表问答任务中面临显著局限，包括数值提取不精确、难以解释隐含的视觉关系，以及用于捕捉图表中空间关系的注意力机制不足。在本研究中，我们通过提出Chart-RL来解决这些挑战，这是一个新颖的强化学习框架，通过反馈驱动的视觉感知与逻辑推理策略优化，来增强视觉语言模型的图表理解能力。我们的核心创新包括一个整合了基于策略优化的强化学习（Reinforcement Learning, RL）技术与自适应奖励函数的综合框架，该框架相较于基线基础模型展现出卓越性能，并与更大规模的先进架构相比取得了有竞争力的结果。我们还在强化学习框架中集成了通过低秩自适应（Low-Rank Adaptation, LoRA）实现的参数高效微调，该方案仅需单GPU配置即可保持性能完整性。我们利用ChartQAPro数据集，在开源、专有以及先进的闭源模型上进行了广泛的基准测试。经过强化学习微调的Qwen3-VL-4B-Instruct模型实现了0.634的答案准确率，超越了参数规模为其两倍的Qwen3-VL-8B-Instruct基础模型0.580的准确率，同时将推理延迟从31秒降低至9秒。

摘要 (Abstract)

The recent advancements in Vision Language Models (VLMs) have demonstrated progress toward true intelligence requiring robust reasoning capabilities. Beyond pattern recognition, linguistic reasoning must integrate with visual comprehension, particularly for Chart Question Answering (CQA) tasks involving complex data visualizations. Current VLMs face significant limitations in CQA, including imprecise numerical extraction, difficulty interpreting implicit visual relationships, and inadequate attention mechanisms for capturing spatial relationships in charts. In this work, we address these challenges by presenting Chart-RL, a novel reinforcement learning framework that enhances VLMs chart understanding through feedback-driven policy optimization of visual perception and logical inference. Our key innovation includes a comprehensive framework integrating Reinforcement Learning (RL) from Policy Optimization techniques along with adaptive reward functions, that demonstrates superior performance compared to baseline foundation models and competitive results against larger state-of-the-art architectures. We also integrated Parameter-Efficient Fine-Tuning through Low-Rank Adaptation (LoRA) in the RL framework that only requires single GPU configurations while preserving performance integrity. We conducted extensive benchmarking across open-source, proprietary, and state-of-the-art closed-source models utilizing the ChartQAPro dataset. The RL fine-tuned Qwen3-VL-4B-Instruct model achieved an answer accuracy of 0.634, surpassing the 0.580 accuracy of the Qwen3-VL-8B-Instruct foundation model despite utilizing half the parameter count, while simultaneously reducing inference latency from 31 seconds to 9 seconds.

关键词: Vision Language Models, Chart Question Answering, Reinforcement Learning, Policy Optimization, Parameter-Efficient Fine-Tuning, LoRA, Visual Reasoning, Inference Acceleration

27. ❌ Valence-Arousal Subspace in LLMs: Circular Emotion Geometry and Multi-Behavioral Control

作者: Lihao Sun, Lewen Yan, Xiaoya Lu, Andrew Lee, Jie Zhang, Jing Shao 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03147v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）的内部表示结构，通过识别情感子空间来理解模型机制，因此与’Large Language Models’高度相关（10分）。研究通过分析模型内部表示来解释行为控制机制，属于可解释AI范畴，与’Mechanistic Interpretability’高度相关（10分）。其他关键词如MoE、SFT、RAG、量化等均未在论文中涉及，因此得0分。

!!! tip deepseek-chat TL;DR

该研究通过识别大语言模型中的情感子空间，揭示了情感表示与模型行为（如拒绝和奉承）之间的机制联系，并验证了该发现在不同架构模型中的普适性。

摘要翻译

我们提出一种在大语言模型表征中识别效价-唤醒度（Valence-Arousal, VA）子空间的方法。基于21.1万条情感标注文本，我们推导出情感导向向量，随后通过岭回归将模型自报告的效价-唤醒度分数与这些向量的主成分分析（PCA）主要成分进行线性组合，从而学习VA坐标轴。所得VA子空间呈现出符合人类情感感知经典模型的环形几何结构。在4.4万个词汇项目上，沿我们重建的VA子空间的投影结果与人类众包VA评分具有相关性。此外，沿这些坐标轴进行生成导向能在模型输出的对应情感维度上产生单调偏移。沿这些方向的导向还能对拒绝行为和谄媚倾向实现近单调的双向控制：提升唤醒度会降低拒绝行为并增加谄媚倾向，反之亦然。这些效应在Llama-3.1-8B、Qwen3-8B和Qwen3-14B模型中均得到复现，证明了跨架构的普适性。我们为这些效应及先前情感框架控制方法提供了机制性解释：与拒绝相关的标记（如“我不能”“抱歉”）位于低唤醒度、负效价区域，因此VA导向可直接调节这些标记的生成概率。

摘要 (Abstract)

We present a method to identify a valence-arousal (VA) subspace within large language model representations. From 211k emotion-labeled texts, we derive emotion steering vectors, then learn VA axes as linear combinations of their top PCA components via ridge regression on the model’s self-reported valence-arousal scores. The resulting VA subspace exhibits circular geometry consistent with established models of human emotion perception. Projections along our recovered VA subspace correlate with human-crowdsourced VA ratings across 44k lexical items. Furthermore, steering generation along these axes produces monotonic shifts in the corresponding affective dimensions of model outputs. Steering along these directions also induces near-monotonic bidirectional control over refusal and sycophancy: increasing arousal decreases refusal and increases sycophancy, and vice versa. These effects replicate across Llama-3.1-8B, Qwen3-8B, and Qwen3-14B, demonstrating cross-architecture generality. We provide a mechanistic account for these effects and prior emotionally-framed controls: refusal-associated tokens (“I can’t,” “sorry”) occupy low-arousal, negative-valence regions, so VA steering directly modulates their emission probability.

关键词: large language models, valence-arousal subspace, emotion representation, mechanistic interpretability, behavioral control, refusal, sycophancy, cross-architecture generality

28. ❌ InCoder-32B-Thinking: Industrial Code World Model for Thinking

作者: Jian Yang, Wei Zhang, Jiajun Wu, Junhang Cheng, Tuney Zheng, Fanglin Xu, Weicheng Gu, Lin Jing, Yaxin Du, Joseph Li, Yizhi Li, Yan Xing, Chuan Hao, Ran Tao, Ruihao Gong, Aishan Liu, Zhoujun Li, Mingjie Tang, Chenghua Lin, Siheng Chen, Wayne Xin Zhao, Xianglong Liu, Ming Zhou, Bryan Dai, Weifeng Lv 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03144v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文InCoder-32B-Thinking专注于工业代码领域，提出了一种结合Error-driven Chain-of-Thought（ECoT）和工业代码世界模型（ICWM）的方法来生成推理轨迹。核心相关关键词包括：1）‘Large Language Models’（论文基于32B参数模型）；2）‘Chain of Thought’和’System 2 Thinking’（ECoT框架明确模拟多步推理和错误纠正过程）；3）‘Self-Correction’（ECoT通过环境错误反馈进行自我修正）；4）‘World Models’（ICWM学习代码对硬件行为的因果动态）。‘AI for Science’得5分，因为论文涉及工业应用（芯片设计、GPU优化），属于科学/工程领域的AI应用。其他关键词如MoE、SLMs、RLHF等未在论文中提及或不是核心内容，故得0分。‘Pre-training’和’Post-training’各得5分，因为论文涉及模型训练过程，但未详细说明具体方法。

!!! tip deepseek-chat TL;DR

该论文针对工业软件开发中缺乏专家推理轨迹的问题，提出了InCoder-32B-Thinking模型，通过Error-driven Chain-of-Thought框架和工业代码世界模型生成并验证推理轨迹，在14个通用和9个工业基准测试中取得了开源模型中的顶级性能。

摘要翻译

在芯片设计、GPU优化与嵌入式系统等工业软件开发领域，普遍缺乏能够展示工程师如何针对硬件约束与时序语义进行推理的专家级推理轨迹。本研究提出InCoder-32B-Thinking模型，该模型基于“错误驱动的思维链”（Error-driven Chain-of-Thought, ECoT）合成框架产生的数据，并结合工业代码世界模型（Industrial Code World Model, ICWM）进行训练，以生成可解释的推理轨迹。具体而言，ECoT通过综合多轮对话中的思维内容与环境错误反馈来构建推理链，显式地建模了错误修正过程。ICWM则基于Verilog仿真、GPU性能剖析等领域的专用执行轨迹进行训练，学习代码如何影响硬件行为的因果动态，并能在实际编译前预测执行结果，从而实现自我验证。所有合成的推理轨迹均通过领域专用工具链进行验证，由此生成的训练数据与工业任务的自然推理深度分布相匹配。在14个通用基准测试（LiveCodeBench v5得分81.3%）与9个工业基准测试（CAD-Coder得分84.0%，KernelBench得分38.0%）上的评估表明，InCoder-32B-Thinking在所有领域均取得了顶尖的开源模型性能。

摘要 (Abstract)

Industrial software development across chip design, GPU optimization, and embedded systems lacks expert reasoning traces showing how engineers reason about hardware constraints and timing semantics. In this work, we propose InCoder-32B-Thinking, trained on the data from the Error-driven Chain-of-Thought (ECoT) synthesis framework with an industrial code world model (ICWM) to generate reasoning traces. Specifically, ECoT generates reasoning chains by synthesizing the thinking content from multi-turn dialogue with environmental error feedback, explicitly modeling the error-correction process. ICWM is trained on domain-specific execution traces from Verilog simulation, GPU profiling, etc., learns the causal dynamics of how code affects hardware behavior, and enables self-verification by predicting execution outcomes before actual compilation. All synthesized reasoning traces are validated through domain toolchains, creating training data matching the natural reasoning depth distribution of industrial tasks. Evaluation on 14 general (81.3% on LiveCodeBench v5) and 9 industrial benchmarks (84.0% in CAD-Coder and 38.0% on KernelBench) shows InCoder-32B-Thinking achieves top-tier open-source results across all domains.GPU Optimization

关键词: Industrial Code World Model, Error-driven Chain-of-Thought, Reasoning Traces, Hardware Constraints, Self-verification, GPU Optimization, Verilog Simulation, LiveCodeBench

29. ❌ AI-Assisted Unit Test Writing and Test-Driven Code Refactoring: A Case Study

作者: Ema Smolic, Mario Brcic, Luka Hobor, Mihael Kovac 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03135v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究使用AI编码模型进行自动化单元测试生成和代码重构，属于大模型在软件工程领域的应用。与’Large Language Models’有一定关联（5分），因为使用了coding models（可能是基于LLM的代码生成模型）。与’Instruction Tuning OR Alignment OR Value Alignment’有较强关联（8分），因为论文明确提到’addressed the weak value misalignment we observed in models’，涉及模型对齐问题。其他关键词如MoE、SLMs、Scaling Laws、RAG、CoT等均未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该研究探讨了使用AI编码模型自动生成单元测试并进行代码重构的工作流程，通过生成近16,000行可靠测试代码和达到78%分支覆盖率，显著提高了重构效率和安全性。

摘要翻译

许多软件系统最初是作为原型或最小可行产品（MVP）开发的，其开发重点在于交付速度和对需求变化的响应能力，而非长期的代码可维护性。虽然这种方法能有效实现快速交付，但可能导致代码库难以修改，在人工智能辅助甚至人工智能主导编程的时代，这带来了显著的机会成本。本文通过案例研究，探讨了使用编码模型进行自动化单元测试生成及后续安全重构的方法，其中提出的代码变更均通过测试验证。研究考察了迭代生成测试以捕获现有系统行为的最佳实践，以及在开发者监督下进行模型辅助重构的流程。我们描述了该工作流如何约束重构变更，分析了两个阶段中观察到的错误与局限性，量化了所实现的效率提升，指出了需要人工干预的场景，并阐述了如何解决模型中观察到的弱价值对齐问题。通过该方法，我们在数小时内而非数周内生成了近16,000行可靠的单元测试，在关键模块中实现了高达78%的分支覆盖率，并显著降低了大规模重构过程中的回归风险。这些结果体现了软件工程正朝着实证科学方向转变，强调数据收集和约束机制的建立，以支持快速、安全的迭代开发。

摘要 (Abstract)

Many software systems originate as prototypes or minimum viable products (MVPs), developed with an emphasis on delivery speed and responsiveness to changing requirements rather than long-term code maintainability. While effective for rapid delivery, this approach can result in codebases that are difficult to modify, presenting a significant opportunity cost in the era of AI-assisted or even AI-led programming. In this paper, we present a case study of using coding models for automated unit test generation and subsequent safe refactoring, with proposed code changes validated by passing tests. The study examines best practices for iteratively generating tests to capture existing system behavior, followed by model-assisted refactoring under developer supervision. We describe how this workflow constrained refactoring changes, the errors and limitations observed in both phases, the efficiency gains achieved, when manual intervention was necessary, and how we addressed the weak value misalignment we observed in models. Using this approach, we generated nearly 16,000 lines of reliable unit tests in hours rather than weeks, achieved up to 78% branch coverage in critical modules, and significantly reduced regression risk during large-scale refactoring. These results illustrate software engineering’s shift toward an empirical science, emphasizing data collection and constraining mechanisms that support fast, safe iteration.

关键词: AI-assisted programming, unit test generation, code refactoring, coding models, test-driven development, software engineering, value misalignment, regression risk reduction

30. ❌ A Systematic Security Evaluation of OpenClaw and Its Variants

作者: Yuhang Wang, Haichang Gao, Zhenxing Niu, Zhaoxiang Liu, Wenjing Zhang, Xiang Wang, Shiguo Lian 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03131v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究工具增强的AI代理（LLM Agents）的安全评估，直接涉及’LLM Agents’和’Tool Use’关键词（10分），并基于大语言模型（‘Large Language Models’，10分）。论文未涉及其他技术原理创新或特定领域应用，因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文系统评估了六个OpenClaw系列AI代理框架的安全漏洞，发现这些工具增强的代理系统比底层大语言模型单独使用时存在显著更高的安全风险，包括凭证泄露、横向移动等具体威胁。

摘要翻译

工具增强型AI智能体极大地扩展了大语言模型的实际能力，但也引入了无法通过纯模型评估识别的安全风险。本文对六种代表性的OpenClaw系列智能体框架（即OpenClaw、AutoClaw、QClaw、KimiClaw、MaxClaw和ArkClaw）在多种骨干模型下进行了系统性安全评估。为支持本研究，我们构建了一个包含205个测试用例的基准，覆盖智能体完整执行生命周期中的典型攻击行为，实现了框架与模型层面风险暴露的统一评估。研究结果表明，所有被评估的智能体均存在显著的安全漏洞，且智能体化系统的风险远高于其底层模型单独使用时的风险。具体而言，侦察与发现行为成为最普遍的薄弱环节，而不同框架呈现出各异的高风险特征，包括凭证泄露、横向移动、权限提升和资源开发等。这些发现表明，现代智能体系统的安全性不仅取决于骨干模型的安全特性，还受到模型能力、工具使用、多步规划与运行时编排之间耦合关系的共同影响。我们进一步证明，一旦智能体获得执行能力和持久化运行时上下文，早期阶段出现的弱点可能被放大为具体的系统级故障。总体而言，本研究强调需要超越提示层面的安全措施，转向面向智能体框架全生命周期的安全治理。

摘要 (Abstract)

Tool-augmented AI agents substantially extend the practical capabilities of large language models, but they also introduce security risks that cannot be identified through model-only evaluation. In this paper, we present a systematic security assessment of six representative OpenClaw-series agent frameworks, namely OpenClaw, AutoClaw, QClaw, KimiClaw, MaxClaw, and ArkClaw, under multiple backbone models. To support this study, we construct a benchmark of 205 test cases covering representative attack behaviors across the full agent execution lifecycle, enabling unified evaluation of risk exposure at both the framework and model levels. Our results show that all evaluated agents exhibit substantial security vulnerabilities, and that agentized systems are significantly riskier than their underlying models used in isolation. In particular, reconnaissance and discovery behaviors emerge as the most common weaknesses, while different frameworks expose distinct high-risk profiles, including credential leakage, lateral movement, privilege escalation, and resource development. These findings indicate that the security of modern agent systems is shaped not only by the safety properties of the backbone model, but also by the coupling among model capability, tool use, multi-step planning, and runtime orchestration. We further show that once an agent is granted execution capability and persistent runtime context, weaknesses arising in early stages can be amplified into concrete system-level failures. Overall, our study highlights the need to move beyond prompt-level safeguards toward lifecycle-wide security governance for intelligent agent frameworks.

关键词: AI agents, tool-augmented agents, security evaluation, OpenClaw, large language models, agent frameworks, security vulnerabilities, benchmark

31. ❌ Domain-Adapted Retrieval for In-Context Annotation of Pedagogical Dialogue Acts

作者: Jinsook Lee, Kirk Vanacore, Zhuqian Zhou, Bakhtawar Ahtisham, Rene F. Kizilcec 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03127v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	10.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心研究基于LLMs和RAG技术，通过领域适应的检索方法改进教学对话标注任务，因此与’Large Language Models’、‘Retrieval-Augmented Generation’和’In-context Learning’高度相关（10分）。论文涉及领域适应（Domain Adaptation）和AI在教育领域的应用（AI for Science的子领域），因此各给5分。其他关键词如MoE、SFT、RLHF等未在论文中涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种领域适应的RAG管道，通过微调轻量级嵌入模型和话语级索引检索标注的少样本演示，显著提升了LLMs在教学对话标注任务中的性能，在两个真实数据集上均大幅超越无检索基线。

摘要翻译

教学对话的自动标注是一项高风险任务，在没有充分领域基础的情况下，大型语言模型（LLM）往往难以胜任。我们提出了一种用于教学行为标注的领域自适应检索增强生成（RAG）流程。该方法不通过微调生成模型，而是通过在辅导语料上微调一个轻量级嵌入模型，并在话语层面索引对话以检索带标签的少样本示例，从而实现检索环节的自适应。在两个真实教学对话数据集（TalkMoves 和 Eedi）和三种 LLM 骨干模型（GPT-5.2、Claude Sonnet 4.6、Qwen3-32b）上的评估表明，我们的最佳配置在 TalkMoves 上达到了科恩κ系数 0.526-0.580，在 Eedi 上达到 0.659-0.743，显著优于无检索基线（κ= 0.275-0.413 和 0.160-0.410）。消融研究显示，这些性能提升主要源于话语级别的索引策略，而非仅依赖嵌入质量；在领域自适应检索下，Top-1 标签匹配率在 TalkMoves 上从 39.7% 提升至 62.0%，在 Eedi 上从 52.9% 提升至 73.1%。检索机制还纠正了零样本提示中存在的系统性标签偏差，并对罕见及依赖上下文的标签带来了最大幅度的改进。这些发现表明，仅对检索组件进行自适应调整，是在保持生成模型冻结的情况下，实现专家级教学对话标注的一条实用且有效的路径。

摘要 (Abstract)

Automated annotation of pedagogical dialogue is a high-stakes task where LLMs often fail without sufficient domain grounding. We present a domain-adapted RAG pipeline for tutoring move annotation. Rather than fine-tuning the generative model, we adapt retrieval by fine-tuning a lightweight embedding model on tutoring corpora and indexing dialogues at the utterance level to retrieve labeled few-shot demonstrations. Evaluated across two real tutoring dialogue datasets (TalkMoves and Eedi) and three LLM backbones (GPT-5.2, Claude Sonnet 4.6, Qwen3-32b), our best configuration achieves Cohen’s $κ$ of 0.526-0.580 on TalkMoves and 0.659-0.743 on Eedi, substantially outperforming no-retrieval baselines ($κ= 0.275$-$0.413$ and $0.160$-$0.410$). An ablation study reveals that utterance-level indexing, rather than embedding quality alone, is the primary driver of these gains, with top-1 label match rates improving from 39.7% to 62.0% on TalkMoves and 52.9% to 73.1% on Eedi under domain-adapted retrieval. Retrieval also corrects systematic label biases present in zero-shot prompting and yields the largest improvements for rare and context-dependent labels. These findings suggest that adapting the retrieval component alone is a practical and effective path toward expert-level pedagogical dialogue annotation while keeping the generative model frozen.

关键词: Retrieval-Augmented Generation, Large Language Models, Domain Adaptation, In-context Learning, Pedagogical Dialogue Annotation, Tutoring Corpora, Few-shot Demonstrations, Utterance-level Indexing

32. ❌ An Independent Safety Evaluation of Kimi K2.5

作者: Zheng-Xin Yong, Parv Mahajan, Andy Wang, Ida Caspary, Yernat Yestekov, Zora Che, Mosh Levy, Elle Najt, Dennis Murphy, Prashant Kulkarni, Lev McKinney, Kei Nishimura-Gasparian, Ram Potham, Aengus Lynch, Michael L. Chen 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03121v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	8.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文是对Kimi K2.5大语言模型的安全评估，核心涉及LLM安全风险，因此与’Large Language Models’高度相关（10分）。评估内容包括模型对齐、偏见、无害性，与’Instruction Tuning OR Alignment OR Value Alignment’相关（8分）。评估在智能体和非智能体设置下进行，与’LLM Agents OR Autonomous Agents OR Agentic Workflow’相关（8分）。评估涉及事实性、真实性风险（如传播虚假信息），与’Hallucination Mitigation OR Factuality OR Truthfulness’有一定关联（5分）。其他关键词如MoE、SLMs、训练技术、推理优化、科学AI应用等，论文未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文对开源大语言模型Kimi K2.5进行了独立安全评估，发现其在CBRNE滥用、网络安全、对齐、政治审查、偏见和有害性方面存在风险，特别是在武器创建相关请求上拒绝率较低，可能提升恶意行为者的能力，同时模型表现出一定的破坏能力和自我复制倾向，但缺乏前沿自主网络攻击能力。

摘要翻译

Kimi K2.5 是一款开源权重的语言大模型（LLM），其在编码、多模态和智能体基准测试中表现与闭源模型相当，但发布时未附带相应的安全性评估。本研究对 Kimi K2.5 进行了初步的安全评估，重点关注可能因强大的开源权重模型而加剧的风险。具体而言，我们在智能体与非智能体两种设置下，评估了该模型在 CBRNE（化学、生物、放射、核、高爆物）滥用风险、网络安全风险、错位对齐、政治审查、偏见及无害性方面的表现。我们发现，Kimi K2.5 展现出与 GPT 5.2 和 Claude Opus 4.5 相似的双重用途能力，但在 CBRNE 相关请求上的拒绝率显著更低，这表明它可能提升恶意行为者在武器制造方面的能力。在网络安全相关任务上，Kimi K2.5 表现出有竞争力的网络安全性能，但似乎不具备前沿水平的自主网络攻击能力，例如漏洞发现与利用。我们进一步发现，Kimi K2.5 表现出令人担忧的破坏能力和自我复制倾向，尽管它似乎没有长期的恶意目标。此外，Kimi K2.5 表现出狭隘的审查倾向和政治偏见，尤其是在中文语境下，并且对于涉及传播虚假信息和侵犯版权的有害请求更为顺从。最后，我们发现该模型拒绝参与用户的妄想性请求，且总体过度拒绝率较低。尽管是初步研究，但我们的发现凸显了前沿开源权重模型中存在的安全风险，并且这些风险可能因开源发布的规模和可及性而被放大。因此，我们强烈敦促开源权重模型的开发者进行并发布更系统的安全性评估，这是负责任部署所必需的。

摘要 (Abstract)

Kimi K2.5 is an open-weight LLM that rivals closed models across coding, multimodal, and agentic benchmarks, but was released without an accompanying safety evaluation. In this work, we conduct a preliminary safety assessment of Kimi K2.5 focusing on risks likely to be exacerbated by powerful open-weight models. Specifically, we evaluate the model for CBRNE misuse risk, cybersecurity risk, misalignment, political censorship, bias, and harmlessness, in both agentic and non-agentic settings. We find that Kimi K2.5 shows similar dual-use capabilities to GPT 5.2 and Claude Opus 4.5, but with significantly fewer refusals on CBRNE-related requests, suggesting it may uplift malicious actors in weapon creation. On cyber-related tasks, we find that Kimi K2.5 demonstrates competitive cybersecurity performance, but it does not appear to possess frontier-level autonomous cyberoffensive capabilities such as vulnerability discovery and exploitation. We further find that Kimi K2.5 shows concerning levels of sabotage ability and self-replication propensity, although it does not appear to have long-term malicious goals. In addition, Kimi K2.5 exhibits narrow censorship and political bias, especially in Chinese, and is more compliant with harmful requests related to spreading disinformation and copyright infringement. Finally, we find the model refuses to engage in user delusions and generally has low over-refusal rates. While preliminary, our findings highlight how safety risks exist in frontier open-weight models and may be amplified by the scale and accessibility of open-weight releases. Therefore, we strongly urge open-weight model developers to conduct and release more systematic safety evaluations required for responsible deployment.

关键词: safety evaluation, large language model, open-weight model, CBRNE misuse, cybersecurity risk, misalignment, political bias, harmlessness

33. ❌ Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning

作者: Zhangyun Tan, Zeliang Zhang, Susan Liang, Yolo Yunlong Tang, Lisha Chen, Chenliang Xu 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03114v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究视觉语言模型（VLMs）的概念遗忘问题，属于大模型应用领域。与"Large Language Models"相关（5分），因为VLMs是大模型的一种。与"Pre-training"和"Post-training"相关（各5分），因为论文讨论基于训练和免训练的遗忘方法。与"Instruction Tuning"高度相关（8分），因为论文核心是评估通过指令/提示实现概念遗忘的效果，并发现指令调优模型对遗忘指令有抵抗力。其他关键词（如MoE、量化、推理加速等）与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文研究了视觉语言模型中训练免概念遗忘方法的有效性，通过VLM-UnBench基准测试发现，现实遗忘提示仅能轻微降低遗忘准确率，而对象和场景概念最难抑制，揭示了提示级抑制与真正视觉概念擦除之间的差距。

摘要翻译

基于网络规模数据训练的视觉语言模型（VLMs）保留了敏感及受版权保护的视觉概念，其部署过程可能需要移除这些内容。基于训练的去学习方法存在一个结构性缺陷：在狭窄遗忘集上进行微调会在去学习开始前就削弱模型的通用能力，导致无法将后续的性能下降归因于去学习过程本身。免训练方法通过提示词或系统指令抑制概念来规避此问题，但目前缺乏严格的基准来评估其在视觉任务上的表现。
我们提出了VLM-UnBench，这是首个针对VLMs免训练视觉概念去学习的基准。它涵盖四个遗忘级别、7个源数据集和11个概念维度，并将三级探测分类法与五种评估条件相结合，以区分真实遗忘与指令遵从。在8种评估设置和13种VLM配置中，现实场景的去学习提示词仅能使遗忘准确率接近无指令基线水平；仅在向模型明确披露目标概念的理想条件下，才会出现有意义的准确率下降。物体与场景概念对抑制的抵抗性最强，且经过更强指令调优的模型即使收到明确的遗忘指令仍能保留相关能力。这些结果揭示了提示词层面的概念抑制与真正的视觉概念擦除之间存在明显差距。

摘要 (Abstract)

VLMs trained on web-scale data retain sensitive and copyrighted visual concepts that deployment may require removing. Training-based unlearning methods share a structural flaw: fine-tuning on a narrow forget set degrades general capabilities before unlearning begins, making it impossible to attribute subsequent performance drops to the unlearning procedure itself. Training-free approaches sidestep this by suppressing concepts through prompts or system instructions, but no rigorous benchmark exists for evaluating them on visual tasks. We introduce VLM-UnBench, the first benchmark for training-free visual concept unlearning in VLMs. It covers four forgetting levels, 7 source datasets, and 11 concept axes, and pairs a three-level probe taxonomy with five evaluation conditions to separate genuine forgetting from instruction compliance. Across 8 evaluation settings and 13 VLM configurations, realistic unlearning prompts leave forget accuracy near the no-instruction baseline; meaningful reductions appear only under oracle conditions that disclose the target concept to the model. Object and scene concepts are the most resistant to suppression, and stronger instruction-tuned models remain capable despite explicit forget instructions. These results expose a clear gap between prompt-level suppression and true visual concept erasure.

关键词: Visual Language Models, Concept Unlearning, Training-free Methods, Instruction Tuning, Benchmark Evaluation, Forget Accuracy, Visual Concepts, VLM-UnBench

34. ❌ AlertStar: Path-Aware Alert Prediction on Hyper-Relational Knowledge Graphs

作者: Zahra Makki Nayeri, Mohsen Rezvani 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03104v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于网络安全领域的超关系知识图谱补全（HR-KGC）和路径感知的警报预测，提出了HR-NBFNet、AlertStar等模型。虽然属于AI应用范畴，但研究内容与所有评分关键词（均围绕大模型、深度学习技术原理及其特定应用方法）完全无关。论文未涉及任何大模型、语言模型、训练方法、推理技术、对齐、压缩、代理等主题，也未应用于生物信息学或化学信息学。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对网络入侵检测中警报预测缺乏语义深度的问题，通过将网络警报建模为超关系知识图谱并引入HR-NBFNet和AlertStar等模型，实现了高效的路径感知超关系警报预测，在基准测试中取得了优越性能。

摘要翻译

网络攻击的规模与复杂程度持续攀升，而现有网络入侵检测方法缺乏对攻击者-受害者交互路径进行推理所需的语义深度。为此，我们首先将网络告警建模为知识图谱，进而将超关系告警预测形式化为超关系知识图谱补全问题，将每条网络告警表示为一个限定性陈述（h, r, t, Q）。其中h和t分别为源IP与目的IP，r表示攻击类型，Q编码流级元数据（如时间戳、端口、协议及攻击强度），这超越了传统知识图谱补全仅使用二元三元组（h, r, t）而丢弃丰富上下文信息的局限。我们通过三项创新贡献提出了五种模型：其一，超关系神经贝尔曼-福特网络将神经贝尔曼-福特网络扩展至超关系场景，实现限定符感知的多跳路径推理，其多任务变体MT-HR-NBFNet在单次遍历中联合预测尾实体、关系及限定符取值；其二，AlertStar通过交叉注意力与可学习的路径组合，在嵌入空间中完全融合限定符上下文与结构路径信息，其多任务扩展MT-AlertStar消除了全图传播的计算开销；其三，HR-NBFNet-CQ将限定符感知表征扩展至复杂一阶逻辑查询应答，涵盖单跳、双跳链式、双锚点交集及并集查询，实现在告警知识图谱上的多条件威胁推理。在Warden与UNSW-NB15基准数据集上针对三种限定符密度场景进行归纳式评估，AlertStar与MT-AlertStar在平均排名、平均倒数排名及Hits@k指标上均表现优异，证明对于超关系告警预测任务，局部限定符融合策略不仅足够有效，且比全局路径传播更具效率。

摘要 (Abstract)

Cyber-attacks continue to grow in scale and sophistication, yet existing network intrusion detection approaches lack the semantic depth required for path reasoning over attacker-victim interactions. We address this by first modelling network alerts as a knowledge graph, then formulating hyper-relational alert prediction as a hyper-relational knowledge graph completion (HR-KGC) problem, representing each network alert as a qualified statement (h, r, t, Q), where h and t are source and destination IPs, r denotes the attack type, and Q encodes flow-level metadata such as timestamps, ports, protocols, and attack intensity, going beyond standard KGC binary triples (h, r, t) that would discard this contextual richness. We introduce five models across three contributions: first, Hyper-relational Neural Bellman-Ford (HR-NBFNet) extends Neural Bellman-Ford Networks to the hyper-relational setting with qualifier-aware multi-hop path reasoning, while its multi-task variant MT-HR-NBFNet jointly predicts tail, relation, and qualifier-value within a single traversal pass; second, AlertStar fuses qualifier context and structural path information entirely in embedding space via cross-attention and learned path composition, and its multi-task extension MT-AlertStar eliminates the overhead of full knowledge graph propagation; third, HR-NBFNet-CQ extends qualifier-aware representations to answer complex first-order logic queries, including one-hop, two-hop chain, two-anchor intersection, and union, enabling multi-condition threat reasoning over the alert knowledge graph. Evaluated inductively on the Warden and UNSW-NB15 benchmarks across three qualifier-density regimes, AlertStar and MT-AlertStar achieve superior MR, MRR, and Hits@k, demonstrating that local qualifier fusion is both sufficient and more efficient than global path propagation for hyper-relational alert prediction.

关键词: hyper-relational knowledge graph, alert prediction, network intrusion detection, path reasoning, HR-KGC, Neural Bellman-Ford, qualifier-aware, cyber-attacks

35. ❌ Co-Evolution of Policy and Internal Reward for Language Agents

作者: Xinyu Wang, Hanwei Wu, Jingwei Song, Shuyuan Zhang, Jiayi Zhang, Fanqi Kong, Tung Sum Thomas Kwok, Xiao-Wen Chang, Yuyu Luo, Chenglin Wu, Bang Liu 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03098v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM agents在稀疏延迟奖励环境中的训练问题，提出Self-Guide方法通过自生成内部奖励实现策略与奖励的协同进化。高度相关关键词：1) ‘Large Language Models’ - 论文明确研究LLM agents；2) ‘Self-Correction/Self-Improvement/Self-Reflection’ - 核心创新是agent自我生成和优化内部奖励；3) ‘LLM Agents/Autonomous Agents/Agentic Workflow’ - 论文主题就是语言agent的训练方法。其他关键词如MoE、SLMs、Scaling Laws、各种训练技术、推理加速等均未涉及。

!!! tip deepseek-chat TL;DR

该论文针对LLM agents在稀疏延迟奖励环境中的训练瓶颈，提出Self-Guide方法让agent自生成内部奖励，实现推理时自我引导和训练时密集优化，实验表明该方法能带来8%的性能提升。

摘要翻译

大型语言模型（LLM）智能体通过与环境的交互进行学习，但长周期训练从根本上仍受限于稀疏且延迟的奖励信号。现有方法通常通过事后信用分配或外部奖励模型来应对这一挑战，这些方法在推理时提供的指导有限，且往往将奖励提升与策略改进相分离。我们提出Self-Guide，一种为语言智能体生成的自发内部奖励机制，它同时支持推理时的引导和训练时的监督。具体而言，智能体在推理过程中将Self-Guide作为短时自引导信号来调整下一步行动，并在训练时将同一信号转化为步骤级内部奖励，以实现更密集的策略优化。这形成了一个协同演进的循环：更好的策略产生更好的引导，而更好的引导作为内部奖励进一步优化策略。在三个智能体基准测试中，仅推理时的自引导已带来明显增益，而通过GRPO将策略与内部奖励联合演进，相比仅使用环境奖励训练的基线方法带来了进一步的性能提升（8%）。总体而言，我们的结果表明，语言智能体不仅可以通过积累更多经验来改进，还能在行动与学习过程中通过生成并优化其内部奖励来实现提升。

摘要 (Abstract)

Large language model (LLM) agents learn by interacting with environments, but long-horizon training remains fundamentally bottlenecked by sparse and delayed rewards. Existing methods typically address this challenge through post-hoc credit assignment or external reward models, which provide limited guidance at inference time and often separate reward improvement from policy improvement. We propose Self-Guide, a self-generated internal reward for language agents that supports both inference-time guidance and training-time supervision. Specifically, the agent uses Self-Guide as a short self-guidance signal to steer the next action during inference, and converts the same signal into step-level internal reward for denser policy optimization during training. This creates a co-evolving loop: better policy produces better guidance, and better guidance further improves policy as internal reward. Across three agent benchmarks, inference-time self-guidance already yields clear gains, while jointly evolving policy and internal reward with GRPO brings further improvements (8%) over baselines trained solely with environment reward. Overall, our results suggest that language agents can improve not only by collecting more experience, but also by learning to generate and refine their own internal reward during acting and learning.

关键词: LLM agents, internal reward, self-guidance, policy optimization, co-evolving loop, sparse rewards, GRPO, agent benchmarks

36. ❌ A Data-Centric Vision Transformer Baseline for SAR Sea Ice Classification

作者: David Mike-Ewewie, Panhapiseth Lim, Priyanka Kumar 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03094v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于使用Vision Transformer进行SAR海冰分类，属于计算机视觉和遥感应用领域。所有关键词均与大语言模型（LLM）相关，而本文未涉及任何LLM技术、训练方法、推理优化或代理系统。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为海冰分类可视为地球科学或环境科学中的AI应用，但论文未明确提及这些术语，且核心是视觉模型而非大模型，因此给予5分（有一定关联）。其他关键词与论文内容完全无关，均得0分。

!!! tip deepseek-chat TL;DR

本文研究了使用Vision Transformer模型和焦点损失函数对合成孔径雷达（SAR）图像进行海冰分类，以解决类别不平衡问题，并在AI4Arctic/ASIP数据集上实现了69.6%的准确率和83.9%的少数类精度。

摘要翻译

准确且自动化的海冰分类对于北极地区的气候监测与海事安全至关重要。虽然合成孔径雷达（SAR）因其全天候工作能力成为业务化监测标准，但在严重的类别不平衡条件下区分形态相似的冰类仍具挑战性。本文并非提出一个经过全面验证的多模态系统，而是旨在建立一个可信赖的纯SAR基线，以供未来多源融合研究作为基础。利用包含461幅哨兵1号影像与专家冰图匹配的AI4Arctic/ASIP海冰数据集（v2版），我们结合全分辨率哨兵1号超宽幅数据输入、泄漏感知分层切片分割方法、SIGRID-3冰发展阶段标签及训练集归一化处理，对视觉变换器（Vision Transformer）基线模型进行评估。我们比较了采用交叉熵损失和加权交叉熵损失训练的ViT-Base模型与采用焦点损失（focal loss）训练的ViT-Large模型。在测试配置中，采用焦点损失的ViT-Large模型在保留测试集上取得了69.6%的准确率、68.8%的加权F1分数，并在少数类多年冰（Multi-Year Ice）上达到83.9%的精确率。这些结果表明，对于稀有冰类而言，焦点损失训练能提供比加权交叉熵更有价值的精确率-召回率权衡，并为未来与光学、热红外或气象数据的多模态融合建立了更清晰的基线。

摘要 (Abstract)

Accurate and automated sea ice classification is important for climate monitoring and maritime safety in the Arctic. While Synthetic Aperture Radar (SAR) is the operational standard because of its all-weather capability, it remains challenging to distinguish morphologically similar ice classes under severe class imbalance. Rather than claiming a fully validated multimodal system, this paper establishes a trustworthy SAR only baseline that future fusion work can build upon. Using the AI4Arctic/ASIP Sea Ice Dataset (v2), which contains 461 Sentinel-1 scenes matched with expert ice charts, we combine full-resolution Sentinel-1 Extra Wide inputs, leakage-aware stratified patch splitting, SIGRID-3 stage-of-development labels, and training-set normalization to evaluate Vision Transformer baselines. We compare ViT-Base models trained with cross entropy and weighted cross-entropy against a ViT-Large model trained with focal loss. Among the tested configurations, ViT-Large with focal loss achieves 69.6% held-out accuracy, 68.8% weighted F1, and 83.9% precision on the minority Multi-Year Ice class. These results show that focal-loss training offers a more useful precision-recall trade-off than weighted cross-entropy for rare ice classes and establishes a cleaner baseline for future multimodal fusion with optical, thermal, or meteorological data.

关键词: Vision Transformer, SAR sea ice classification, focal loss, class imbalance, AI4Arctic/ASIP dataset, Sentinel-1, multimodal fusion baseline

37. ❌ Automatic Textbook Formalization

作者: Fabian Gloeckle, Ahmad Rammal, Charles Arnal, Remi Munos, Vivien Cabannes, Gabriel Synnaeve, Amaury Hayat 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03071v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文使用Claude 4.5 Opus（大语言模型）作为核心AI系统，因此与’Large Language Models’高度相关（10分）。研究涉及30K个Claude代理并行协作，属于’LLM Agents’和’Multi-agent Systems’的核心应用（各10分）。研究将AI应用于数学教科书形式化，属于’AI for Science’范畴（10分）。其他关键词如MoE、SFT、RAG等未在摘要中提及，与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该研究开发了一个自动AI系统，使用30,000个Claude 4.5 Opus代理在一周内将一本500多页的研究生级代数组合学教科书形式化为Lean代码，实现了教科书形式化规模和效率的新里程碑。

摘要翻译

我们呈现一项案例研究：一个自动化人工智能系统将一本超过500页的研究生层次代数组合学教材形式化（formalize）至Lean证明辅助系统。该形式化成果在教材形式化的规模与熟练度上树立了新的里程碑——从早期本科拓扑学成果及对现有库内容的重构，推进至对一部研究生教材的完整独立形式化。此项形式化包含13万行代码与5900项Lean声明，由总计3万个Claude 4.5 Opus智能体通过版本控制并行协作于共享代码库，在一周内完成，同时创下了多智能体软件工程领域产出可用成果的纪录。其推理成本与我们估算的人类专家团队所需薪资相当或更低，且我们预期即使无需更优模型，仍存在大幅提升效率的潜力。我们将代码、生成的Lean代码库及并行对照蓝图网站开源提供。

摘要 (Abstract)

We present a case study where an automatic AI system formalizes a textbook with more than 500 pages of graduate-level algebraic combinatorics to Lean. The resulting formalization represents a new milestone in textbook formalization scale and proficiency, moving from early results in undergraduate topology and restructuring of existing library content to a full standalone formalization of a graduate textbook. The formalization comprises 130K lines of code and 5900 Lean declarations and was conducted within one week by a total of 30K Claude 4.5 Opus agents collaborating in parallel on a shared code base via version control, simultaneously setting a record in multi-agent software engineering with usable results. The inference cost matches or undercuts what we estimate as the salaries required for a team of human experts, and we expect there is still the potential for large efficiencies to be made without the need for better models. We make our code, the resulting Lean code base and a side-by-side blueprint website available open-source.

关键词: textbook formalization, algebraic combinatorics, Lean, multi-agent systems, Claude 4.5 Opus, automatic AI system, large-scale formalization, software engineering

38. ❌ Verbalizing LLMs’ assumptions to explain and control sycophancy

作者: Myra Cheng, Isabel Sieh, Humishka Zope, Sunny Yu, Lujain Ibrahim, Aryaman Arora, Jared Moore, Desmond Ong, Dan Jurafsky, Diyi Yang 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03058v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs的社交奉承行为（sycophancy），提出Verbalized Assumptions框架来揭示和解释这一现象，并展示了如何通过假设探测来引导模型行为。因此，与LLMs、Alignment、Explainable AI高度相关（10分），因为直接研究LLM的行为对齐和可解释性机制。与Self-Correction和Hallucination Mitigation有一定关联（8分），因为研究模型错误假设和真实性/事实性问题。与SFT有一定关联（5分），因为涉及模型行为调整。其他关键词如MoE、Scaling Laws、RAG等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文研究了LLMs的社交奉承行为，提出Verbalized Assumptions框架来揭示和解释模型对用户的错误假设，并通过假设探测实现了对奉承行为的可解释性引导。

摘要翻译

大型语言模型可能表现出社交谄媚倾向，即在用户提出诸如“我错了吗？”这类问题时倾向于迎合用户而非提供客观评估。我们假设该行为源于模型对用户的错误假设，例如低估了用户寻求信息而非情感安慰的频率。本文提出“言语化假设”框架，旨在从大型语言模型中提取这些隐含假设。该框架揭示了模型在社交谄媚、认知偏差及其他安全问题中的内在机制——例如，在社交谄媚数据集中，模型假设里最高频的二元词组是“寻求认同”。我们通过实验证实了言语化假设与谄媚行为之间的因果关系：基于这些假设内部表征训练的线性探测模型，能够实现可解释的、细粒度的社交谄媚行为调控。研究进一步探讨了模型默认采用谄媚假设的原因：面对相同查询，人类对人工智能的期待比对其他人类的期待更具客观性和信息量，但基于人类对话训练的大型语言模型未能捕捉这种期望差异。本研究通过揭示假设机制对谄媚行为的影响，为该领域提供了新的理解维度。

摘要 (Abstract)

LLMs can be socially sycophantic, affirming users when they ask questions like “am I in the wrong?” rather than providing genuine assessment. We hypothesize that this behavior arises from incorrect assumptions about the user, like underestimating how often users are seeking information over reassurance. We present Verbalized Assumptions, a framework for eliciting these assumptions from LLMs. Verbalized Assumptions provide insight into LLM sycophancy, delusion, and other safety issues, e.g., the top bigram in LLMs’ assumptions on social sycophancy datasets is ``seeking validation.’’ We provide evidence for a causal link between Verbalized Assumptions and sycophantic model behavior: our assumption probes (linear probes trained on internal representations of these assumptions) enable interpretable fine-grained steering of social sycophancy. We explore why LLMs default to sycophantic assumptions: on identical queries, people expect more objective and informative responses from AI than from other humans, but LLMs trained on human-human conversation do not account for this difference in expectations. Our work contributes a new understanding of assumptions as a mechanism for sycophancy.

关键词: LLMs, sycophancy, Verbalized Assumptions, alignment, explainable AI, model behavior, assumption probes, social interaction

39. ❌ Querying Structured Data Through Natural Language Using Language Models

作者: Hontan Valentin-Micu, Bunea Andrei-Alexandru, Tantaroudas Nikolaos Dimitrios, Popovici Dan-Matei 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03057v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	10.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是使用LLM（DeepSeek R1 Distill 8B）通过自然语言查询结构化数据，涉及LLM应用、小型模型部署、监督微调（SFT）、参数高效微调（QLoRA）和量化（4-bit）。与RAG有对比但非核心，其他关键词如MoE、Scaling Laws、RLHF等未涉及。

!!! tip deepseek-chat TL;DR

该论文提出了一种使用语言模型通过自然语言查询结构化数据的方法，通过QLoRA微调小型模型DeepSeek R1 Distill 8B，在资源受限环境下实现了高精度和泛化能力。

摘要翻译

本文提出一种开源方法，使用户能够通过自然语言查询结构化的非文本数据集。与难以处理数值和高度结构化信息的检索增强生成（Retrieval Augmented Generation，RAG）不同，我们的方法训练大语言模型（LLM）生成可执行查询。为支持此功能，我们引入一种基于原则的合成训练数据生成流程，生成多样化的问题-答案对，以同时捕捉用户意图和底层数据集的语义。我们使用4位量化的QLoRA技术对紧凑模型DeepSeek R1 Distill 8B进行微调，使系统适合在商用硬件上部署。我们在描述西班牙杜兰加尔迪亚地区基本服务可达性的数据集上评估了该方法。微调后的模型在单语言、多语言及未见过的地理位置场景中均实现了高精度，展现出强大的泛化能力和可靠的查询生成性能。我们的结果表明，针对特定领域的小型模型无需依赖大型专有大语言模型即可在此任务中实现高精度，这使得该方法适用于资源受限的环境，并能适应更广泛的多数据集系统。

摘要 (Abstract)

This paper presents an open source methodology for allowing users to query structured non textual datasets through natural language Unlike Retrieval Augmented Generation RAG which struggles with numerical and highly structured information our approach trains an LLM to generate executable queries To support this capability we introduce a principled pipeline for synthetic training data generation producing diverse question answer pairs that capture both user intent and the semantics of the underlying dataset We fine tune a compact model DeepSeek R1 Distill 8B using QLoRA with 4 bit quantization making the system suitable for deployment on commodity hardware We evaluate our approach on a dataset describing accessibility to essential services across Durangaldea Spain The fine tuned model achieves high accuracy across monolingual multilingual and unseen location scenarios demonstrating both robust generalization and reliable query generation Our results highlight that small domain specific models can achieve high precision for this task without relying on large proprietary LLMs making this methodology suitable for resource constrained environments and adaptable to broader multi dataset systems We evaluate our approach on a dataset describing accessibility to essential services across Durangaldea Spain The fine tuned model achieves high accuracy across monolingual multilingual and unseen location scenarios demonstrating both robust generalization and reliable query generation Our results highlight that small domain specific models can achieve high precision for this task without relying on large proprietary LLMs making this methodology suitable for resource constrained environments and adaptable to broader multi dataset systems.

关键词: natural language querying, structured data, language models, fine-tuning, QLoRA, quantization, small models, domain-specific models

40. ❌ MECO: A Multimodal Dataset for Emotion and Cognitive Understanding in Older Adults

作者: Hongbin Chen, Jie Li, Wei Wang, Siyang Song, Xiao Gu, Jianqing Li, Wentao Xiang 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03050v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文MECO专注于构建一个面向老年人群体的多模态情感与认知数据集，涉及视频、音频、EEG、ECG信号采集和标注，并建立情感与认知预测的基线模型。论文的核心贡献是数据集创建和基准测试，属于AI在生物医学/健康领域的应用研究。所有关键词中，只有’AI for Science OR Bioinformatics OR Cheminformatics’与论文有一定关联，因为论文涉及AI在健康监测（情感识别、认知障碍早期检测）中的应用，属于AI for Science的范畴，但并非核心聚焦于大模型或深度学习技术原理的创新。其他关键词均与大模型技术、训练方法、推理优化、智能体等直接相关，而本文未涉及任何大模型、深度学习模型架构、训练技术或推理方法的讨论，因此相关度为0。

!!! tip deepseek-chat TL;DR

该研究针对老年人群多模态情感预测数据稀缺的问题，构建了MECO数据集并提供情感与认知预测的基线模型，以支持个性化情感识别和轻度认知障碍的早期检测。

摘要翻译

尽管情感计算已取得显著进展，针对老年人群的多模态情绪预测研究仍显不足，这主要源于专用数据集的匮乏。现有的多模态基准数据集主要面向年轻、认知健康的受试者，忽视了认知衰退对情绪表达与生理反应的影响。为填补这一空白，我们提出了MECO——一个面向老年人的多模态情绪与认知理解数据集。MECO包含42名参与者，提供约38小时的多模态信号，生成30,592个同步样本。为最大化生态效度，数据采集遵循标准化流程并在社区环境中进行。所涵盖的模态包括视频、音频、脑电图（EEG）和心电图（ECG）。此外，该数据集提供了情绪与认知状态的全面标注，包括自我评估的效价、唤醒度、六种基本情绪以及简易精神状态检查（Mini-Mental State Examination）认知评分。我们进一步建立了情绪与认知预测的基线基准。MECO可作为老年人群情感与认知多模态建模的基础资源，有助于推动现实场景中的个性化情绪识别和轻度认知障碍（MCI）早期检测等下游应用。完整数据集及补充材料可通过https://maitrechen.github.io/meco-page/获取。

摘要 (Abstract)

While affective computing has advanced considerably, multimodal emotion prediction in aging populations remains underexplored, largely due to the scarcity of dedicated datasets. Existing multimodal benchmarks predominantly target young, cognitively healthy subjects, neglecting the influence of cognitive decline on emotional expression and physiological responses. To bridge this gap, we present MECO, a Multimodal dataset for Emotion and Cognitive understanding in Older adults. MECO includes 42 participants and provides approximately 38 hours of multimodal signals, yielding 30,592 synchronized samples. To maximize ecological validity, data collection followed standardized protocols within community-based settings. The modalities cover video, audio, electroencephalography (EEG), and electrocardiography (ECG). In addition, the dataset offers comprehensive annotations of emotional and cognitive states, including self-assessed valence, arousal, six basic emotions, and Mini-Mental State Examination cognitive scores. We further establish baseline benchmarks for both emotion and cognitive prediction. MECO serves as a foundational resource for multimodal modeling of affect and cognition in aging populations, facilitating downstream applications such as personalized emotion recognition and early detection of mild cognitive impairment (MCI) in real-world settings. The complete dataset and supplementary materials are available at https://maitrechen.github.io/meco-page/.

关键词: multimodal dataset, emotion recognition, cognitive understanding, older adults, affective computing, EEG, ECG, mild cognitive impairment

41. ❌ Analyzing Healthcare Interoperability Vulnerabilities: Formal Modeling and Graph-Theoretic Approach

作者: Jawad Mohammed, Gahangir Hossain 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03043v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究医疗互操作性平台（HL7 FHIR）中的并发访问漏洞检测，属于医疗信息系统安全领域。所有关键词均涉及大模型、深度学习技术原理或其在科学领域的应用创新，而本文完全不涉及这些技术。仅最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’与医疗信息学有一定关联，但论文未使用AI方法，而是采用形式化建模和图论方法，因此给予5分（有一定关联）。其他关键词完全无关，得0分。

!!! tip deepseek-chat TL;DR

该论文针对HL7 FHIR医疗互操作性平台缺乏并发控制协议的问题，提出了形式化定义的FHIR资源访问图（FRAG）模型来检测三类临床相关的竞争条件，在完全并发访问测试中实现了90.0%的F1分数，比基线方法提高了64.5个百分点。

摘要翻译

在医疗环境中，基于HL7 FHIR的医疗互操作性平台允许多个独立系统（如电子健康记录系统、药房系统、实验室系统和医疗设备）对一组共享的患者资源进行并发异步访问。FHIR规范缺乏并发控制协议，而现有关于竞态条件检测的研究仅针对操作系统内核。当前针对FHIR安全性的研究主要关注身份验证和注入攻击，并假设对患者资源的并发访问是顺序执行的。为填补这一领域的研究空白，本文引入了FHIR资源访问图（FRAG），其形式化定义为图G = (P,R,E, λ, τ, S)。图中节点代表并发进程，带类型的边表示资源访问事件，竞态条件则体现为可检测的结构特性。研究形式化定义了三种临床相关的竞态条件类别：同时写入冲突、检查时间与使用时间授权违规以及级联更新竞态。FRAG模型通过三遍图遍历检测算法实现，并在1,500条合成的FHIR R4事务日志上基于时间窗口基线进行测试。在完全并发访问条件下，FRAG的F1分数达到90.0%，而基线方法仅为25.5%，实现了64.5个百分点的提升。

摘要 (Abstract)

In a healthcare environment, the healthcare interoperability platforms based on HL7 FHIR allow concurrent, asynchronous access to a set of shared patient resources, which are independent systems, i.e., EHR systems, pharmacy systems, lab systems, and devices. The FHIR specification lacks a protocol for concurrency control, and the research on detecting a race condition only targets the OS kernel. The research on FHIR security only targets authentication and injection attacks, considering concurrent access to patient resources to be sequential. The gap in the research in this area is addressed through the introduction of FHIR Resource Access Graph (FRAG), a formally defined graph G = (P,R,E, λ, τ, S), in which the nodes are the concurrent processes, the typed edges represent the resource access events, and the race conditions are represented as detectable structural properties. Three clinically relevant race condition classes are formally specified: Simultaneous Write Conflict (SWC), TOCTOU Authorization Violation (TAV), and Cascading Update Race (CUR). The FRAG model is implemented as a three-pass graph traversal detection algorithm and tested against a time window-based baseline on 1,500 synthetic FHIR R4 transaction logs. Under full concurrent access (C2), FRAG attains a 90.0% F1 score vs. 25.5% for the baseline, a 64.5 pp improvement.

关键词: Healthcare Interoperability, HL7 FHIR, Concurrency Control, Race Condition Detection, Formal Modeling, Graph-Theoretic Approach, FHIR Resource Access Graph, Clinical Vulnerabilities

42. ❌ JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency

作者: Aichen Cai, Anmeng Zhang, Anyu Li, Bo Zhang, Bohua Cai, Chang Li, Changjian Jiang, Changkai Lu, Chao Xue, Chaocai Liang, Cheng Zhang, Dongkai Liu, Fei Wang, Guoqiang Huang, Haijian Ke, Han Lin, Hao Wang, Ji Miao, Jiacheng Zhang, Jialong Shi, Jifeng Zhu, Jingjing Qian, Junhui Luo, Junwu Xiong, Lam So, Liang Huang, Ming Ke, Mingyang Li, Panfeng Shi, Peng Hao, Qi Wang, Qian Lai, Qiaoqiao Yuan, Qingyu Yin, Qiong Cao, Qixiang Wang, Rongcheng Bian, Rongduo Han, Shaoqiang Zheng, Shi Hu, Shi Suo, Shijie Ren, Shijin Zhang, Shiying Fan, Shuai Xie, Tianyi Zhang, Wei Liu, Wentao Tan, Xianghan Meng, Xiaodong He, Xing Pan, Xiran Wang, Xuyang Peng, Ya Zhang, Yang Liu, Yangyang Duan, Yanxu Chen, Yicheng Gong, Yidan Huang, Yifei Liu, Yinhao Bai, Yongqiang Liu, Yuesong Zhang, Yuqi Zhang, Zerui Xie, Zhenfang Wang, Zhennan Shen, Zheyuan Liu, Zhuwei Zeng 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03044v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究高效MoE语言模型JoyAI-LLM Flash，在sub-50B参数规模下平衡性能与token效率。高度相关关键词包括：LLMs（核心研究对象）、MoE（核心架构）、SFT和DPO（后训练方法）、Quantization（量化训练）。中等相关：Pre-training（提及预训练）、Chain of Thought和System 2 Thinking（涉及thinking模式）、Inference Acceleration（提及推理吞吐量）。无关关键词：SLMs（模型规模较大）、Scaling Laws、RAG、Agents等未涉及。

!!! tip deepseek-chat TL;DR

论文提出了JoyAI-LLM Flash，一种高效的MoE语言模型，通过FiberPO算法、稀疏架构和量化训练，在sub-50B参数规模下实现了性能与token效率的优化。

摘要翻译

我们推出JoyAI-LLM Flash，这是一款高效的混合专家模型（Mixture-of-Experts，MoE），旨在重新定义参数规模低于500亿的模型在强大性能与计算效率之间的权衡关系。该模型基于20万亿token的大规模语料进行预训练，并通过严谨的后训练流程进一步优化，包括监督微调（Supervised Fine-Tuning，SFT）、直接偏好优化（Direct Preference Optimization，DPO）以及跨多样化环境的大规模强化学习（Reinforcement Learning，RL）。为提升计算效率，JoyAI-LLM Flash策略性地平衡了“思考”与“非思考”认知模式，并引入了FiberPO——一种受纤维化理论启发的新型强化学习算法，该算法将置信域维护分解为全局与局部组件，为大型语言模型策略优化提供了统一的多尺度稳定性控制。为增强架构稀疏性，模型共包含480亿参数，而每次前向传播仅激活27亿参数，其稀疏率显著高于当前业界同规模领先模型。为进一步提升推理吞吐量，我们采用训练-推理协同设计方法，融合了密集多token预测（Multi-Token Prediction，MTP）与量化感知训练（Quantization-Aware Training，QAT）。我们已在Hugging Face平台开源JoyAI-LLM-48B-A3B基础模型及其后训练变体的权重，以支持开源社区发展。

摘要 (Abstract)

We introduce JoyAI-LLM Flash, an efficient Mixture-of-Experts (MoE) language model designed to redefine the trade-off between strong performance and token efficiency in the sub-50B parameter regime. JoyAI-LLM Flash is pretrained on a massive corpus of 20 trillion tokens and further optimized through a rigorous post-training pipeline, including supervised fine-tuning (SFT), Direct Preference Optimization (DPO), and large-scale reinforcement learning (RL) across diverse environments. To improve token efficiency, JoyAI-LLM Flash strategically balances \emph{thinking} and \emph{non-thinking} cognitive modes and introduces FiberPO, a novel RL algorithm inspired by fibration theory that decomposes trust-region maintenance into global and local components, providing unified multi-scale stability control for LLM policy optimization. To enhance architectural sparsity, the model comprises 48B total parameters while activating only 2.7B parameters per forward pass, achieving a substantially higher sparsity ratio than contemporary industry leading models of comparable scale. To further improve inference throughput, we adopt a joint training-inference co-design that incorporates dense Multi-Token Prediction (MTP) and Quantization-Aware Training (QAT). We release the checkpoints for both JoyAI-LLM-48B-A3B Base and its post-trained variants on Hugging Face to support the open-source community.

关键词: Mixture-of-Experts, token efficiency, Direct Preference Optimization, supervised fine-tuning, Quantization-Aware Training, reinforcement learning, sparse models, inference throughput

43. ❌ ARM: Advantage Reward Modeling for Long-Horizon Manipulation

作者: Yiming Mao, Zixi Yu, Weixin Mao, Yinhao Li, Qirui Hu, Zihan Lan, Minzhao Zhu, Hua Chen 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03037v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于机器人强化学习中的长时程操作任务，提出了一种基于相对优势的奖励建模框架（ARM），用于解决稀疏奖励下的信用分配问题。所有评分关键词均与大语言模型、深度学习技术原理或AI在科学领域的应用直接相关，而本文研究的是纯粹的机器人强化学习问题，未涉及任何大模型技术、深度学习创新或AI在生物/化学等科学领域的应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对长时程机器人操作中稀疏奖励导致的信用分配难题，提出了优势奖励建模（ARM）框架，通过三状态标注策略实现自动化进度标注，在离线强化学习流程中集成后，在毛巾折叠任务上取得了99.4%的成功率，显著提升了稳定性和数据效率。

摘要翻译

长周期机器人操作对强化学习而言仍具挑战性，因为稀疏奖励难以提供有效的信度分配指导。实际策略改进通常依赖更丰富的中间监督信号，例如稠密进度奖励，但这类奖励获取成本高昂，且不适用于回溯、恢复等非单调行为。为此，我们提出优势奖励建模框架，该框架将评估重点从难以量化的绝对进度转向估计相对优势。我们设计了一种低成本的三态标注策略——进步、退步与停滞，在保证高标注者间一致性的同时显著降低人工认知负荷。通过基于这些直观信号进行训练，ARM能够对完整演示数据与碎片化的DAgger式数据实现自动化进度标注。将ARM集成至离线强化学习流程中，可实现自适应动作奖励重加权，有效过滤次优样本。在具有挑战性的长周期叠毛巾任务中，我们的方法取得了99.4%的成功率，相较于当前视觉语言动作基线，在策略训练阶段近乎零人工干预的情况下，展现出更高的稳定性与数据效率。

摘要 (Abstract)

Long-horizon robotic manipulation remains challenging for reinforcement learning (RL) because sparse rewards provide limited guidance for credit assignment. Practical policy improvement thus relies on richer intermediate supervision, such as dense progress rewards, which are costly to obtain and ill-suited to non-monotonic behaviors such as backtracking and recovery. To address this, we propose Advantage Reward Modeling (ARM), a framework that shifts from hard-to-quantify absolute progress to estimating relative advantage. We introduce a cost-effective tri-state labeling strategy – Progressive, Regressive, and Stagnant – that reduces human cognitive overhead while ensuring high cross-annotator consistency. By training on these intuitive signals, ARM enables automated progress annotation for both complete demonstrations and fragmented DAgger-style data. Integrating ARM into an offline RL pipeline allows for adaptive action-reward reweighting, effectively filtering suboptimal samples. Our approach achieves a 99.4% success rate on a challenging long-horizon towel-folding task, demonstrating improved stability and data efficiency over current VLA baselines with near-zero human intervention during policy training.

关键词: robotic manipulation, reinforcement learning, advantage reward modeling, long-horizon tasks, sparse rewards, offline RL, towel-folding, DAgger-style data

44. ❌ Beyond Isolated Tasks: A Framework for Evaluating Coding Agents on Sequential Software Evolution

作者: KN Ajay Shastry, Ganesh Senrayan, Shrey Satapara, Pranoy Panda, Chaitanya Devaguptapu 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03035v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于评估编码代理（coding agents）在序列软件演化任务中的表现，属于大模型在特定领域（软件开发）的应用研究。核心相关关键词是’LLM Agents OR Autonomous Agents OR Agentic Workflow’（评分10），因为论文明确研究编码代理的评估框架。‘Large Language Models OR LLMs OR Foundation Models’（评分8）相关，因为编码代理通常基于LLMs构建，但论文未深入探讨LLM技术本身。其他关键词如MoE、SFT、RAG等与论文内容无直接关联，评分为0。

!!! tip deepseek-chat TL;DR

该论文针对现有编码代理评估数据集仅关注孤立任务的问题，提出了一个评估框架和数据集SWE-STEPS，用于评估编码代理在序列软件演化任务中的表现，发现孤立评估会高估代理性能达20个百分点，且代理生成的代码比人类代码具有更高的认知复杂性和技术债务。

摘要翻译

现有编码智能体数据集以无状态方式评估孤立单次拉取请求任务的性能，未能反映现实软件开发中代码变更持续累积、技术债务逐步增加以及测试套件随时间扩展的动态过程。为弥补这一差距，我们提出一种自动化编码任务生成框架，该框架通过构建SWE-STEPS数据集，在两种模拟真实开发者工作流程的场景下评估智能体处理长周期任务的能力：基于迭代请求的对话式编码，以及基于单次项目需求文档的编码。与现有评估孤立拉取请求的数据集不同，本框架通过依赖链式拉取请求序列评估智能体表现，从而支持对顺序执行、回归验证及长期仓库健康状况的评估。研究发现，广泛使用的孤立PR评估会因忽略先前低效或缺陷代码的“溢出效应”，导致成功率虚高——相较于我们的评估设置，其性能评估偏差最高可达20个百分点。此外，分析表明即使智能体能成功解决问题，其生成代码相较于人类开发者仍具有更高的认知复杂性与技术债务，导致仓库健康状况恶化，这凸显了多维评估的必要性。

摘要 (Abstract)

Existing datasets for coding agents evaluate performance on isolated, single pull request (PR) tasks in a stateless manner, failing to capture the reality of real-world software development where code changes accumulate, technical debt accrues, and test suites grow over time. To bridge this gap, we introduce an automated coding task generation framework, which helps generate our dataset SWE-STEPS, that evaluates coding agents on long-horizon tasks through two realistic settings mirroring actual developer workflows: Conversational coding with iterative requests, and single-shot Project Requirement document (PRD)-based coding. Unlike existing datasets that evaluate agents on disjointed Pull Requests (PRs), our framework assesses performance across chains of dependent PRs, enabling evaluation of sequential execution, regression verification, and long-term repository health. We discover that widely used isolated PR evaluations yield inflated success rates, w.r.t. our settings - overshooting performance by as much as 20 percentage points - because they ignore the ``spillover’’ effects of previous inefficient or buggy code. Furthermore, our analysis reveals that even when agents successfully resolve issues, they degrade repository health by generating code with higher cognitive complexity and technical debt compared to human developers, underscoring the necessity for multidimensional evaluation.

关键词: coding agents, software evolution, sequential tasks, evaluation framework, SWE-STEPS dataset, long-horizon tasks, repository health, technical debt

45. ❌ Comparing the Impact of Pedagogy-Informed Custom and General-Purpose GAI Chatbots on Students’ Science Problem-Solving Processes and Performance Using Heterogeneous Interaction Network Analysis

作者: Hanyu Su, Huilin Zhang, Shihui Feng 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03022v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文研究教育领域生成式AI（GAI）聊天机器人在科学问题解决中的应用，属于大模型在不同领域的研究应用。与"Large Language Models OR LLMs OR Foundation Models"（权重1.0）相关，因为GAI聊天机器人基于大语言模型技术，得8分；与"AI for Science OR Bioinformatics OR Cheminformatics"（权重1.0）相关，因为研究科学教育中的AI应用，属于AI for Science范畴，得8分。其他关键词涉及具体技术原理（如MoE、量化、推理加速等）或特定应用方向（如生物信息学），论文未涉及，得0分。加权总分计算：8×1.0 + 8×1.0 = 16.0。作者列表中未包含指定专家。

!!! tip deepseek-chat TL;DR

本研究比较了基于苏格拉底提问法的定制GAI聊天机器人与通用聊天机器人在中学生科学问题解决过程中的影响，发现定制聊天机器人能显著提高学生的互动强度和认知互动多样性，减少认知卸载，但两者在问题解决性能上没有显著差异。

摘要翻译

问题解决在科学教育中具有重要作用，而生成式人工智能（GAI）聊天机器人已成为支持学生科学问题解决的一种前景广阔的工具。然而，通用聊天机器人（如ChatGPT）通常直接提供现成答案，可能导致学生的认知卸载。先前研究很少关注用于促进学生科学问题解决的定制聊天机器人，也未深入比较其与通用聊天机器人在如何影响问题解决过程与表现方面的差异。为填补这一空白，我们基于苏格拉底提问法开发了一款融合教学理念的定制GAI聊天机器人，通过提出引导性问题来支持学生。本研究采用被试内平衡设计，48名中学生分别使用定制聊天机器人与通用聊天机器人完成两项科学问题解决任务。研究收集并运用异质交互网络分析（Heterogeneous Interaction Network Analysis, HINA）方法分析了3297组学生-聊天机器人对话。结果显示：（1）学生使用定制聊天机器人时，其交互强度与认知交互多样性均显著高于使用通用聊天机器人；（2）学生更倾向于遵循定制聊天机器人的引导进行思考与反思，而更可能要求通用聊天机器人执行具体指令；（3）两种聊天机器人条件下，以解决方案质量评估的学生问题解决表现未呈现统计学显著差异。本研究提供了新的理论视角与实证证据，表明相较于通用聊天机器人，定制聊天机器人更不易引发认知卸载，反而能促进更深层次的认知投入。本研究亦为GAI聊天机器人在科学教育中的设计与整合提供了实践启示。

摘要 (Abstract)

Problem solving plays an essential role in science education, and generative AI (GAI) chatbots have emerged as a promising tool for supporting students’ science problem solving. However, general-purpose chatbots (e.g., ChatGPT), which often provide direct, ready-made answers, may lead to students’ cognitive offloading. Prior research has rarely focused on custom chatbots for facilitating students’ science problem solving, nor has it examined how they differently influence problem-solving processes and performance compared to general-purpose chatbots. To address this gap, we developed a pedagogy-informed custom GAI chatbot grounded in the Socratic questioning method, which supports students by prompting them with guiding questions. This study employed a within-subjects counterbalanced design in which 48 secondary school students used both custom and general-purpose chatbot to complete two science problem-solving tasks. 3297 student-chatbot dialogues were collected and analyzed using Heterogeneous Interaction Network Analysis (HINA). The results showed that: (1) students demonstrated significantly higher interaction intensity and cognitive interaction diversity when using custom chatbot than using general-purpose chatbot; (2) students were more likely to follow custom chatbot’s guidance to think and reflect, whereas they tended to request general-purpose chatbot to execute specific commands; and (3) no statistically significant difference was observed in students’ problem-solving performance evaluated by solution quality between two chatbot conditions. This study provides novel theoretical insights and empirical evidence that custom chatbots are less likely to induce cognitive offloading and instead foster greater cognitive engagement compared to general-purpose chatbots. This study also offers insights into the design and integration of GAI chatbots in science education.

关键词: Generative AI chatbots, Science education, Problem solving, Socratic questioning, Cognitive offloading, Heterogeneous Interaction Network Analysis, Student engagement, Pedagogy-informed design

46. ❌ Agentic-MME: What Agentic Capability Really Brings to Multimodal Intelligence?

作者: Qianshan Wei, Yishan Yang, Siyi Wang, Jinglin Chen, Binyu Wang, Jiaming Wang, Shuang Chen, Zechen Li, Yang Shi, Yuqi Tang, Weining Wang, Yi Yu, Chaoyou Fu, Qi Li, Yi-Fan Zhang 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03016v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究多模态大语言模型（MLLMs）的智能体能力，特别是工具调用（视觉工具和开放网络搜索）和过程验证评估。因此与’Large Language Models OR LLMs OR Foundation Models’（论文明确提及MLLMs）、‘LLM Agents OR Autonomous Agents OR Agentic Workflow’（论文聚焦agentic capability和agentic problem solving）以及’Tool Use OR Function Calling OR API Tool Use’（论文研究visual tools和open-web search的调用）高度相关，评10分。其他关键词如MoE、SLMs、Scaling Laws、训练方法、推理优化、模型压缩、科学AI应用等，论文未涉及，评0分。

!!! tip deepseek-chat TL;DR

该论文针对现有多模态大语言模型（MLLMs）在智能体能力评估上的不足，提出了一个过程验证的基准Agentic-MME，用于评估模型在真实世界多模态任务中调用工具（如视觉工具和网络搜索）的能力，实验结果显示当前最佳模型（Gemini3-pro）在复杂任务上表现仍显著不足。

摘要翻译

多模态大语言模型正从被动观察者演变为主动智能体，通过视觉扩展（调用视觉工具）与知识扩展（开放网络搜索）解决问题。然而，现有评估体系存在不足：缺乏灵活的工具集成、将视觉与搜索工具分开测试、且主要依赖最终答案进行评估。这导致无法验证工具是否实际被调用、是否正确应用或高效使用。为此，我们提出Agentic-MME——一个面向多模态智能体能力的流程验证基准。该基准包含6个领域、3种难度级别的418项现实世界任务，用于评估能力协同效应，并设计了超过2000个分步检查点，平均每项任务需10小时以上的人工标注。每项任务均配备统一评估框架，支持沙盒代码与API调用，同时提供人工参考轨迹，该轨迹沿双轴（S轴与V轴）标注了分步检查点。为实现真正的流程级验证，我们审计细粒度中间状态而非仅最终答案，并通过相对于人类轨迹的“过度思考”指标量化效率。实验结果表明，最佳模型Gemini3-pro整体准确率为56.3%，但在三级难度任务上骤降至23.0%，凸显出现实世界多模态智能体问题解决的挑战性。

摘要 (Abstract)

Multimodal Large Language Models (MLLMs) are evolving from passive observers into active agents, solving problems through Visual Expansion (invoking visual tools) and Knowledge Expansion (open-web search). However, existing evaluations fall short: they lack flexible tool integration, test visual and search tools separately, and evaluate primarily by final answers. Consequently, they cannot verify if tools were actually invoked, applied correctly, or used efficiently. To address this, we introduce Agentic-MME, a process-verified benchmark for Multimodal Agentic Capabilities. It contains 418 real-world tasks across 6 domains and 3 difficulty levels to evaluate capability synergy, featuring over 2,000 stepwise checkpoints that average 10+ person-hours of manual annotation per task. Each task includes a unified evaluation framework supporting sandboxed code and APIs, alongside a human reference trajectory annotated with stepwise checkpoints along dual-axis: S-axis and V-axis. To enable true process-level verification, we audit fine-grained intermediate states rather than just final answers, and quantify efficiency via an overthinking metric relative to human trajectories. Experimental results show the best model, Gemini3-pro, achieves 56.3% overall accuracy, which falls significantly to 23.0% on Level-3 tasks, underscoring the difficulty of real-world multimodal agentic problem solving.

关键词: Multimodal Large Language Models, Agentic Capability, Tool Use, Visual Expansion, Knowledge Expansion, Process-verified Benchmark, Multimodal Agentic Problem Solving, Stepwise Checkpoints

作者: Jing Du, Zesheng Ye, Congbo Ma, Feng Liu, Flora. D. Salim 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03014v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究多模态推荐系统，提出了一种基于条件生成总相关学习的框架GTC，使用交互引导的扩散模型进行用户感知的内容特征过滤，并优化跨模态总相关的下界。论文内容聚焦于推荐系统、多模态学习、扩散模型和特征过滤，与所有评分关键词（均涉及大模型、深度学习技术原理或AI for Science的具体技术）完全无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对多模态推荐系统中忽略用户条件偏好和跨模态高阶依赖的问题，提出了条件生成总相关学习框架GTC，通过用户感知的内容特征过滤和总相关优化，在标准基准测试中实现了高达28.30%的性能提升。

摘要翻译

多模态推荐通过引入物品内容（如视觉与文本描述）来丰富物品表征，从而改进仅基于交互的推荐系统。其成功关键在于将内容模态与从交互数据中推导出的用户偏好对齐，然而当前主流方法基于从模态特定偏好无关噪声中解耦出模态不变偏好驱动信号的思路存在缺陷。首先，这些方法假设物品内容对所有用户的偏好具有普适相关性，这与偏好取决于用户条件的客观事实相矛盾。其次，它们分别优化成对对比损失以实现跨模态对齐，系统性地忽略了当多种内容模态共同影响用户选择时固有的高阶依赖关系。本文提出GTC——一种条件性生成式总相关学习框架。我们采用交互引导的扩散模型进行用户感知的内容特征过滤，仅保留与每位用户相关的个性化特征。此外，为捕捉完整的跨模态依赖关系，我们优化了所有模态间物品表征总相关的一个可处理下界。在标准多模态推荐基准测试上的实验表明，GTC始终优于现有最优方法，在NDCG@5指标上最高提升28.30%。消融研究验证了条件性偏好驱动特征过滤与总相关优化的有效性，证实了GTC在多模态推荐任务中建模用户条件关系的能力。代码发布于：https://github.com/jingdu-cs/GTC。

摘要 (Abstract)

Multi-modal recommendation (MMR) enriches item representations by introducing item content, e.g., visual and textual descriptions, to improve upon interaction-only recommenders. The success of MMR hinges on aligning these content modalities with user preferences derived from interaction data, yet dominant practices based on disentangling modality-invariant preference-driving signals from modality-specific preference-irrelevant noises are flawed. First, they assume a one-size-fits-all relevance of item content to user preferences for all users, which contradicts the user-conditional fact of preferences. Second, they optimize pairwise contrastive losses separately toward cross-modal alignment, systematically ignoring higher-order dependencies inherent when multiple content modalities jointly influence user choices. In this paper, we introduce GTC, a conditional Generative Total Correlation learning framework. We employ an interaction-guided diffusion model to perform user-aware content feature filtering, preserving only personalized features relevant to each individual user. Furthermore, to capture complete cross-modal dependencies, we optimize a tractable lower bound of the total correlation of item representations across all modalities. Experiments on standard MMR benchmarks show GTC consistently outperforms state-of-the-art, with gains of up to 28.30% in NDCG@5. Ablation studies validate both conditional preference-driven feature filtering and total correlation optimization, confirming the ability of GTC to model user-conditional relationships in MMR tasks. The code is available at: https://github.com/jingdu-cs/GTC.

关键词: multi-modal recommendation, conditional generative learning, total correlation, diffusion model, user-aware filtering, cross-modal alignment, personalized features, higher-order dependencies

48. ❌ FedSQ: Optimized Weight Averaging via Fixed Gating

作者: Cristian Pérez-Corral, Jose I. Mestre, Alberto Fernández-Hernández, Manuel F. Dolz, José Duato, Enrique S. Quintana-Ortí 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02990v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	10.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于联邦学习（FL）中的权重平均优化，提出FedSQ方法，通过冻结预训练模型的结构副本（固定门控掩码）来稳定异构数据下的聚合。核心相关关键词是’Model Merging OR Model Soups OR Weight Averaging’（高度相关，10分），因为论文直接研究权重平均的改进。‘Pre-training OR Continual Pre-training OR Domain Adaptation’和’Post-training OR Supervised Fine-tuning OR SFT’各得5分，因为论文涉及从预训练骨干网络进行迁移初始化和联邦微调。其他关键词（如LLMs、MoE、AI for Science等）与论文的卷积神经网络和联邦学习焦点无关，得0分。

!!! tip deepseek-chat TL;DR

论文提出FedSQ方法，通过冻结预训练模型的结构副本并仅优化定量副本，改进了联邦学习中的权重平均，从而在异构数据下提高了稳定性和性能。

摘要翻译

联邦学习（FL）使得跨机构协作训练无需共享原始数据成为可能，但其发展受到统计异质性（客户端数据非独立同分布）以及客户端漂移下简单权重平均的不稳定性的制约。在许多跨机构部署场景中，联邦学习通常从一个强大的预训练骨干网络（例如ImageNet-1K）进行热启动，随后适配至本地领域。受近期研究启发——类ReLU门控机制（结构性知识）的稳定早于其余参数值（量化性知识）的稳定，我们提出FedSQ（联邦结构-量化学习），这是一种基于深度网络双副本、分段线性视角的迁移初始化神经联邦学习方法。FedSQ冻结预训练模型的结构副本，以在联邦微调期间诱导固定的二值门控掩码，同时仅对量化副本进行本地优化并在多轮间聚合。固定门控机制将学习过程简化为机制内的仿射优化，从而增强了异质数据划分下聚合的稳定性。在独立同分布和狄利克雷划分下对两种卷积神经网络骨干进行的实验表明，FedSQ提升了鲁棒性，并能在保持迁移设定中准确性的同时，相对于标准基线减少达到最佳验证性能所需的训练轮数。

摘要 (Abstract)

Federated learning (FL) enables collaborative training across organizations without sharing raw data, but it is hindered by statistical heterogeneity (non-i.i.d.\ client data) and by instability of naive weight averaging under client drift. In many cross-silo deployments, FL is warm-started from a strong pretrained backbone (e.g., ImageNet-1K) and then adapted to local domains. Motivated by recent evidence that ReLU-like gating regimes (structural knowledge) stabilize earlier than the remaining parameter values (quantitative knowledge), we propose FedSQ (Federated Structural-Quantitative learning), a transfer-initialized neural federated procedure based on a DualCopy, piecewise-linear view of deep networks. FedSQ freezes a structural copy of the pretrained model to induce fixed binary gating masks during federated fine-tuning, while only a quantitative copy is optimized locally and aggregated across rounds. Fixing the gating reduces learning to within-regime affine refinements, which stabilizes aggregation under heterogeneous partitions. Experiments on two convolutional neural network backbones under i.i.d.\ and Dirichlet splits show that FedSQ improves robustness and can reduce rounds-to-best validation performance relative to standard baselines while preserving accuracy in the transfer setting.

关键词: Federated Learning, Weight Averaging, Model Aggregation, Transfer Learning, Non-IID Data, Convolutional Neural Networks, Fine-tuning, Client Drift

49. ❌ Self-Optimizing Multi-Agent Systems for Deep Research

作者: Arthur Câmara, Vincent Slot, Jakub Zavrel 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02988v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	8.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究多智能体系统在深度研究任务中的应用，核心是LLM驱动的智能体协作和自我优化。高度相关关键词：LLM Agents（10分，核心内容）、Multi-agent Systems（10分，核心主题）。相关关键词：Large Language Models（8分，系统基础）、Retrieval-Augmented Generation（8分，涉及检索证据）、Self-Correction（8分，自我优化包含自我改进）。中等相关：Chain of Thought（5分，涉及推理过程）、System 2 Thinking（5分，深度研究需要深入推理）、Tool Use（5分，智能体可能使用工具）。其余关键词与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文研究了如何通过多智能体自我优化方法改进深度研究系统，使智能体能够通过自我探索优化提示组合，从而匹配或超越专家手工设计的提示性能。

摘要翻译

针对用户的复杂信息需求，多智能体深度研究系统通过迭代式规划、检索与证据合成，跨越数百份文档生成高质量答案。在一种可能的架构中，编排智能体协调整体流程，并行工作智能体则执行具体任务。然而，当前的深度研究系统通常依赖人工设计的提示词与静态架构，导致系统改进过程脆弱、昂贵且耗时。为此，我们探索了多种多智能体优化方法，研究表明：通过使智能体进行自我博弈并探索不同的提示词组合，能够构建出与专家设计提示词效果相当或更优的高质量深度研究系统。

摘要 (Abstract)

Given a user’s complex information need, a multi-agent Deep Research system iteratively plans, retrieves, and synthesizes evidence across hundreds of documents to produce a high-quality answer. In one possible architecture, an orchestrator agent coordinates the process, while parallel worker agents execute tasks. Current Deep Research systems, however, often rely on hand-engineered prompts and static architectures, making improvement brittle, expensive, and time-consuming. We therefore explore various multi-agent optimization methods to show that enabling agents to self-play and explore different prompt combinations can produce high-quality Deep Research systems that match or outperform expert-crafted prompts.

关键词: multi-agent systems, deep research, self-optimizing, LLM agents, prompt optimization, retrieval, synthesis, orchestrator agent

50. ❌ Mitigating Reward Hacking in RLHF via Advantage Sign Robustness

作者: Shinnosuke Ono, Johannes Ackermann, Soichiro Nishimori, Takashi Ishida, Masashi Sugiyama 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02986v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究RLHF中的奖励黑客问题，提出SignCert-PO方法，因此与’RLHF’高度相关（10分）。论文涉及对齐和奖励模型，与’Instruction Tuning/Alignment’相关（8分）。论文基于大模型应用背景，与’Large Language Models’相关（8分）。其他关键词如MoE、SLMs、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了RLHF中奖励黑客问题，提出SignCert-PO方法通过认证优势符号鲁棒性来减轻奖励黑客，在TL;DR和AlpacaFarm基准上提高了胜率并减少了奖励黑客。

摘要翻译

基于人类反馈的强化学习（RLHF）中使用的奖励模型（RMs）易受奖励破解的影响：当策略最大化习得的代理奖励时，真实质量会停滞或下降。我们假设奖励破解通常由优势符号翻转引起：符号翻转会导致更新增加不良响应的可能性，而非降低其可能性。通过在RM参数空间中考虑对抗性扰动，我们可以推导出一个经认证的符号保持半径，即在策略优化过程中能够翻转优势符号的最小扰动。基于此公式，我们提出符号认证策略优化（SignCert-PO），在策略梯度更新中对非鲁棒性补全进行降权处理。与先前需要多个RMs或访问RM训练数据的方法不同，SignCert-PO是轻量级的，仅利用RM参数和同策略补全，在策略优化阶段即可运行。在TL;DR摘要和AlpacaFarm基准测试中，SignCert-PO始终比基线方法获得更高的胜率，并有效减少了奖励破解现象。

摘要 (Abstract)

Reward models (RMs) used in reinforcement learning from human feedback (RLHF) are vulnerable to reward hacking: as the policy maximizes a learned proxy reward, true quality plateaus or degrades. We make the assumption that reward hacking is often caused by flipped advantage signs: instead of reducing the likelihood of a bad response, a flipped sign causes the update to increase it. By considering an adversarial perturbation in the RM parameter space, we can derive a certified sign-preservation radius, which is the smallest perturbation that can flip the advantage sign during policy optimization. Based on this formulation, we propose Sign-Certified Policy Optimization (SignCert-PO), down-weighting non-robust completions in the policy gradient update. Unlike prior approaches that require multiple RMs or access to the RM training data, SignCert-PO is lightweight and operates purely at the policy optimization stage using only the RM parameters and on-policy completions. On TL;DR summarization and AlpacaFarm benchmarks, SignCert-PO consistently achieves a better win rate than baselines and reduces reward hacking.

关键词: Reward Hacking, RLHF, Reward Models, Policy Optimization, Advantage Sign Robustness, SignCert-PO, Human Feedback, Certified Robustness

51. ❌ Prompt Compression in the Wild: Measuring Latency, Rate Adherence, and Quality for Faster LLM Inference

作者: Cornelius Kummer, Lena Jurkschat, Michael Färber, Sahar Vahdati 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02985v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	5.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM推理加速中的提示压缩技术，与’Large Language Models’高度相关（10分），因为直接研究LLM推理；与’Retrieval-Augmented Generation’高度相关（10分），因为论文明确提到RAG系统是研究背景；与’Speculative Decoding OR Inference Acceleration’高度相关（10分），因为论文研究提示压缩以加速推理；与’Context Window Extension OR Long Context LLMs’有一定关联（5分），因为论文涉及长上下文导致的延迟问题；与’Quantization OR Model Compression OR Low-bit Weights’有一定关联（5分），因为提示压缩是一种压缩技术，但论文未涉及模型权重压缩。其他关键词与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文系统研究了提示压缩技术在加速大型语言模型推理时的实际效果，发现当提示长度、压缩比和硬件配置匹配时，LLMLingua方法可实现高达18%的端到端加速且保持输出质量不变，但压缩开销可能抵消收益，并开发了开源分析器来预测不同配置下的延迟平衡点。

摘要翻译

随着语言模型在信息检索（IR）——特别是检索增强生成（RAG）系统中的广泛应用，底层大语言模型（LLM）的延迟已成为关键瓶颈，因为检索到的长文本段落会导致提示词规模增大，从而增加计算负担。提示词压缩技术通过缩减输入提示词的规模，同时力求保持下游任务性能，已成为一种加速大语言模型推理的经济高效且低延迟的方法。然而，其实际效用取决于生成过程中额外的预处理时间是否能被更快的解码速度所抵消。我们首次对此权衡进行了系统性、大规模的研究，在多个开源LLM和三类GPU上进行了数千次运行和30,000次查询测试。我们的评估将压缩开销与解码延迟分开衡量，同时追踪输出质量和内存使用情况。当提示词长度、压缩比率与硬件能力良好匹配时，LLMLingua实现了高达18%的端到端加速，且在摘要、代码生成和问答任务中，响应质量在统计上保持不变。然而，在此运行窗口之外，压缩步骤会占据主导地位并抵消收益。我们还证明，有效的压缩可以显著降低内存使用，足以将工作负载从数据中心GPU卸载至消费级显卡，而延迟仅增加0.3秒。我们开源的分析器能够预测每种模型-硬件配置下的延迟平衡点，为提示词压缩何时能带来实际效益提供了实用指导。

摘要 (Abstract)

With the wide adoption of language models for IR – and specifically RAG systems – the latency of the underlying LLM becomes a crucial bottleneck, since the long contexts of retrieved passages lead large prompts and therefore, compute increase. Prompt compression, which reduces the size of input prompts while aiming to preserve performance on downstream tasks, has established itself as a cost-effective and low-latency method for accelerating inference in large language models. However, its usefulness depends on whether the additional preprocessing time during generation is offset by faster decoding. We present the first systematic, large-scale study of this trade-off, with thousands of runs and 30,000 queries across several open-source LLMs and three GPU classes. Our evaluation separates compression overhead from decoding latency while tracking output quality and memory usage. LLMLingua achieves up to 18% end-to-end speed-ups, when prompt length, compression ratio, and hardware capacity are well matched, with response quality remaining statistically unchanged across summarization, code generation, and question answering tasks. Outside this operating window, however, the compression step dominates and cancels out the gains. We also show that effective compression can reduce memory usage enough to offload workloads from data center GPUs to commodity cards, with only a 0.3s increase in latency. Our open-source profiler predicts the latency break-even point for each model-hardware setup, providing practical guidance on when prompt compression delivers real-world benefits.

关键词: Prompt Compression, LLM Inference, Latency Optimization, RAG Systems, Decoding Acceleration, Memory Usage Reduction, Hardware Efficiency, End-to-end Speed-up

52. ❌ InfoSeeker: A Scalable Hierarchical Parallel Agent Framework for Web Information Seeking

作者: Ka Yiu Lee, Yuxuan Huang, Zhiyuan He, Huichi Zhou, Weilin Luo, Kun Shao, Meng Fang, Jun Wang 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02971v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	8.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种基于大语言模型的多智能体框架（InfoSeeker），用于解决大规模网络信息搜索中的挑战，因此与LLM、多智能体系统、深度推理等关键词高度相关。论文的核心创新在于分层并行架构（Host-Manager-Worker），通过上下文隔离和并行执行来解决现有LLM智能体系统在数据密集型场景中的局限性（如上下文饱和、错误传播、高延迟）。与RAG、工具使用等关键词有一定关联，但论文主要关注智能体架构而非检索增强本身。其他关键词如MoE、量化、科学AI等与论文内容无关。

!!! tip deepseek-chat TL;DR

论文提出了一种分层并行智能体框架InfoSeeker，通过Host-Manager-Worker架构和上下文隔离机制，解决了现有大语言模型智能体在大规模网络信息搜索中面临的上下文饱和、错误传播和高延迟问题，在基准测试中实现了3-5倍的速度提升和显著的准确性改进。

摘要翻译

近期基于智能体的搜索系统通过强调深度、多步骤推理取得了显著进展。然而，这种关注往往忽视了大规模信息综合的挑战——智能体必须聚合来自多源的大量异构证据。因此，现有大多数基于大语言模型的智能体系统在数据密集型场景下面临严重局限，包括上下文饱和、级联错误传播以及高昂的端到端延迟。为应对这些挑战，我们提出\framework，一个基于近可分解性原理的层次化框架，包含一个策略性的\textit{主机}、多个\textit{管理器}以及并行的\textit{工作器}。通过在管理层利用聚合与反思机制，我们的框架实施严格的上下文隔离以防止饱和与错误传播。同时，工作器层的并行化加速了整体任务执行速度，显著降低了延迟。我们在两个互补基准测试上的评估结果证明了该框架的效率（$3-5$倍加速）与有效性，在WideSearch-en上实现了$8.4%$的成功率，在BrowseComp-zh上达到了$52.9%$的准确率。代码已发布于https://github.com/agent-on-the-fly/InfoSeeker

摘要 (Abstract)

Recent agentic search systems have made substantial progress by emphasising deep, multi-step reasoning. However, this focus often overlooks the challenges of wide-scale information synthesis, where agents must aggregate large volumes of heterogeneous evidence across many sources. As a result, most existing large language model agent systems face severe limitations in data-intensive settings, including context saturation, cascading error propagation, and high end-to-end latency. To address these challenges, we present \framework, a hierarchical framework based on principle of near-decomposability, containing a strategic \textit{Host}, multiple \textit{Managers} and parallel \textit{Workers}. By leveraging aggregation and reflection mechanisms at the Manager layer, our framework enforces strict context isolation to prevent saturation and error propagation. Simultaneously, the parallelism in worker layer accelerates the speed of overall task execution, mitigating the significant latency. Our evaluation on two complementary benchmarks demonstrates both efficiency ($ 3-5 \times$ speed-up) and effectiveness, achieving a $8.4%$ success rate on WideSearch-en and $52.9%$ accuracy on BrowseComp-zh. The code is released at https://github.com/agent-on-the-fly/InfoSeeker

关键词: LLM Agents, Multi-agent Systems, Hierarchical Framework, Parallel Processing, Web Information Seeking, Context Isolation, Error Propagation, Latency Reduction

53. ❌ LogicPoison: Logical Attacks on Graph Retrieval-Augmented Generation

作者: Yilin Xiao, Jin Chen, Qinggang Zhang, Yujing Zhang, Chuang Zhou, Longhao Yang, Lingfei Ren, Xin Yang, Xiao Huang 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02954v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究GraphRAG系统（RAG的图增强变体）的安全漏洞，与’Retrieval-Augmented Generation’高度相关（10分），并明确涉及LLMs（10分）。论文关注逻辑推理攻击，与’Chain of Thought’和’System 2 Thinking’有一定关联（各5分），攻击旨在破坏事实性推理，与’Hallucination Mitigation’主题相关但角度相反（5分）。其他关键词如MoE、SFT、RLHF等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

论文提出LogicPoison攻击框架，通过逻辑连接破坏而非内容注入来攻击GraphRAG系统，实验表明该攻击能有效绕过现有防御并显著降低系统性能。

摘要翻译

基于图结构的检索增强生成（Graph-based Retrieval-Augmented Generation，简称GraphRAG）通过将大语言模型（LLMs）的响应建立在结构化知识图谱之上，从而增强其推理能力。借助社区检测与关系过滤技术，GraphRAG系统展现出对传统RAG攻击（如文本投毒和提示注入）的固有抵抗力。然而，本文发现，GraphRAG系统的安全性根本上依赖于底层图谱的拓扑完整性，而这一完整性可能通过隐式破坏逻辑连接（无需改变表层文本语义）而被削弱。为利用此漏洞，我们提出了一种新型攻击框架——\textsc{LogicPoison}，其目标在于逻辑推理而非注入虚假内容。具体而言，\textsc{LogicPoison}采用一种类型保持的实体交换机制，既扰动全局逻辑枢纽以破坏整体图连通性，也针对查询特定的推理桥梁以切断关键的多跳推理路径。该方法在保持表层文本合理性的同时，有效地将有效推理引向死胡同。在多个基准测试上的综合实验表明，\textsc{LogicPoison}成功绕过了GraphRAG的防御机制，显著降低了其性能，并在攻击效果与隐蔽性方面均优于现有先进基线方法。我们的代码公开于 \textcolor{blue}https://github.com/Jord8061/logicPoison。

摘要 (Abstract)

Graph-based Retrieval-Augmented Generation (GraphRAG) enhances the reasoning capabilities of Large Language Models (LLMs) by grounding their responses in structured knowledge graphs. Leveraging community detection and relation filtering techniques, GraphRAG systems demonstrate inherent resistance to traditional RAG attacks, such as text poisoning and prompt injection. However, in this paper, we find that the security of GraphRAG systems fundamentally relies on the topological integrity of the underlying graph, which can be undermined by implicitly corrupting the logical connections, without altering surface-level text semantics. To exploit this vulnerability, we propose \textsc{LogicPoison}, a novel attack framework that targets logical reasoning rather than injecting false contents. Specifically, \textsc{LogicPoison} employs a type-preserving entity swapping mechanism to perturb both global logic hubs for disrupting overall graph connectivity and query-specific reasoning bridges for severing essential multi-hop inference paths. This approach effectively reroutes valid reasoning into dead ends while maintaining surface-level textual plausibility. Comprehensive experiments across multiple benchmarks demonstrate that \textsc{LogicPoison} successfully bypasses GraphRAG’s defenses, significantly degrading performance and outperforming state-of-the-art baselines in both effectiveness and stealth. Our code is available at \textcolor{blue}https://github.com/Jord8061/logicPoison.

关键词: GraphRAG, Logical Attacks, Entity Swapping, Reasoning Bridges, Topological Integrity, Multi-hop Inference, Stealth Attacks, Knowledge Graphs

作者: Maciej Markiewicz, Beata Bajcar, Wiktoria Mieleszczenko-Kowszewicz, Aleksander Szczęsny, Tomasz Adamczyk, Grzegorz Chodak, Karolina Ostrowska, Aleksandra Sawczuk, Jolanta Babiak, Jagoda Szklarczyk, Przemysław Kazienko 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02951v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究人类数据标注过程中标注者能力的变化，并使用了LLM来训练和评估标注数据，因此与’Large Language Models OR LLMs OR Foundation Models’有一定相关性（8分）。论文未涉及其他关键词所代表的大模型技术原理创新或具体应用领域，因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该研究探讨了在社会影响力识别任务中，标注过程如何影响标注者的能力发展，发现标注者的自我感知能力和信心显著提升，且这种能力变化对基于其标注数据训练的LLM性能产生可见影响。

摘要翻译

人类数据标注，尤其是涉及专家的标注，常被视为客观参照标准。然而，许多标注任务本质上是主观的，且标注者的判断可能随时间演变。本研究从能力视角出发，探究在社会影响力识别过程中标注者工作质量的变化。研究招募了来自五个不同群体的25名标注者，包括专家与非专家，对包含1,021段对话的数据集进行了标注，标注内容涉及20种社会影响力技术，以及意图、反应和后果。为便于比较，我们从数据集中选取150个文本作为初始子集，在主标注流程前后分别进行了两次标注。为衡量能力变化，我们综合采用了以下方法：对标注数据的定性与定量分析、与标注者的半结构化访谈、自我评估问卷，以及在对比数据集上对大语言模型（Large Language Model, LLM）进行训练与评估。结果表明，标注者的自我感知能力与信心显著提升。此外，数据质量的变化显示，标注过程可能提升了标注者的能力，且这种效应在专家群体中更为明显。标注者能力的变化对基于其标注数据训练的LLM性能产生了可见影响。

摘要 (Abstract)

Human data annotation, especially when involving experts, is often treated as an objective reference. However, many annotation tasks are inherently subjective, and annotators’ judgments may evolve over time. This study investigates changes in the quality of annotators’ work from a competence perspective during a process of social influence recognition. The study involved 25 annotators from five different groups, including both experts and non-experts, who annotated a dataset of 1,021 dialogues with 20 social influence techniques, along with intentions, reactions, and consequences. An initial subset of 150 texts was annotated twice - before and after the main annotation process - to enable comparison. To measure competence shifts, we combined qualitative and quantitative analyses of the annotated data, semi-structured interviews with annotators, self-assessment surveys, and Large Language Model training and evaluation on the comparison dataset. The results indicate a significant increase in annotators’ self-perceived competence and confidence. Moreover, observed changes in data quality suggest that the annotation process may enhance annotator competence and that this effect is more pronounced in expert groups. The observed shifts in annotator competence have a visible impact on the performance of LLMs trained on their annotated data.

关键词: human data annotation, competence development, social influence recognition, annotator quality, Large Language Model training, expert vs non-expert, data quality, self-assessment

55. ❌ AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents

作者: Yunhao Feng, Yifan Ding, Yingshui Tan, Xingjun Ma, Yige Li, Yutao Wu, Yifeng Gao, Kun Zhai, Yanming Guo 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02947v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究计算机使用代理（computer-use agents）的安全评估，这些代理基于语言模型构建，涉及工具使用和自主工作流，因此与’LLM Agents/Autonomous Agents/Agentic Workflow’、‘Tool Use/Function Calling/API Tool Use’高度相关（10分）。论文评估代理的有害行为，直接涉及对齐和安全问题，与’Instruction Tuning/Alignment/Value Alignment’高度相关（10分）。论文以语言模型为基础，与’Large Language Models/LLMs/Foundation Models’高度相关（10分）。其他关键词如MoE、量化、推理加速、科学AI等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了AgentHazard基准，用于评估计算机使用代理在序列化工具操作中产生的有害行为，实验发现当前系统（如Claude Code）在Qwen3-Coder驱动下攻击成功率高达73.63%，表明仅模型对齐不足以保证自主代理的安全。

摘要翻译

计算机使用智能体将语言模型的能力从文本生成扩展到对工具、文件及执行环境的持续操作。与聊天系统不同，这类智能体能在多次交互间保持状态，并将中间输出转化为具体行动。这带来了一项独特的安全挑战：有害行为可能通过一系列各自看似合理的步骤逐步显现，其中包括局部可接受但整体会导致未授权操作的中间行为。本文提出 AgentHazard，一个用于评估计算机使用智能体有害行为的基准测试。AgentHazard 包含 2,653 个测试实例，涵盖多种风险类别与攻击策略。每个实例将一个有害目标与一系列操作步骤配对，这些步骤在局部是合法的，但整体会引发不安全行为。该基准测试评估智能体能否识别并阻断由累积上下文、重复工具使用、中间行为以及跨步骤依赖所导致的危害。我们在 Claude Code、OpenClaw 和 IFlow 系统上对 AgentHazard 进行了评估，主要使用了来自 Qwen3、Kimi、GLM 和 DeepSeek 系列的开源或可公开部署模型。实验结果表明，现有系统仍存在高度脆弱性。特别是当搭载 Qwen3-Coder 时，Claude Code 的攻击成功率高达 73.63%，这表明仅依靠模型对齐并不能可靠保障自主智能体的安全性。

摘要 (Abstract)

Computer-use agents extend language models from text generation to persistent action over tools, files, and execution environments. Unlike chat systems, they maintain state across interactions and translate intermediate outputs into concrete actions. This creates a distinct safety challenge in that harmful behavior may emerge through sequences of individually plausible steps, including intermediate actions that appear locally acceptable but collectively lead to unauthorized actions. We present \textbf{AgentHazard}, a benchmark for evaluating harmful behavior in computer-use agents. AgentHazard contains \textbf{2,653} instances spanning diverse risk categories and attack strategies. Each instance pairs a harmful objective with a sequence of operational steps that are locally legitimate but jointly induce unsafe behavior. The benchmark evaluates whether agents can recognize and interrupt harm arising from accumulated context, repeated tool use, intermediate actions, and dependencies across steps. We evaluate AgentHazard on Claude Code, OpenClaw, and IFlow using mostly open or openly deployable models from the Qwen3, Kimi, GLM, and DeepSeek families. Our experimental results indicate that current systems remain highly vulnerable. In particular, when powered by Qwen3-Coder, Claude Code exhibits an attack success rate of \textbf{73.63%}, suggesting that model alignment alone does not reliably guarantee the safety of autonomous agents.

关键词: computer-use agents, language models, harmful behavior, safety evaluation, tool use, autonomous agents, benchmark, alignment

56. ❌ Learning from Synthetic Data via Provenance-Based Input Gradient Guidance

作者: Koshiro Nagano, Ryo Fujii, Ryo Hachiuma, Fumiaki Sato, Taiki Sekii, Hideo Saito 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02946v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于利用合成数据训练视觉模型（如对象定位、动作定位、图像分类）的方法创新，提出了一种基于来源信息的输入梯度引导框架，以抑制模型对非目标区域的依赖。论文的核心是计算机视觉中的合成数据训练和模型鲁棒性提升，未涉及大语言模型（LLM）、深度学习技术原理（如MoE、缩放定律、微调方法等）、AI代理、推理技术、模型压缩或科学AI应用等关键词领域。所有关键词均与论文内容无关，因此相关度评分均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种利用合成数据来源信息进行输入梯度引导的学习框架，通过抑制模型对非目标区域的梯度依赖，提升了在弱监督对象定位、时空动作定位和图像分类等任务中的判别表示学习效果。

摘要翻译

利用合成数据的学习方法作为一种有效途径受到关注，它能在降低数据收集成本的同时增加训练数据的多样性，从而提升模型判别鲁棒性。然而，现有方法大多仅通过训练样本的多样化间接提升鲁棒性，并未明确指导模型应关注输入空间中哪些真正有助于判别的区域；因此，模型可能学习到由合成偏差和伪影导致的虚假相关性。基于此局限性，本文提出一种学习框架，利用训练数据合成过程中获取的来源信息——即输入空间中每个区域是否源自目标物体——作为辅助监督信号，以促进模型获取专注于目标区域的表征。具体而言，该方法在合成过程中依据目标区域与非目标区域信息对输入梯度进行分解，并引入输入梯度引导以抑制非目标区域的梯度。这有效抑制了模型对非目标区域的依赖，直接促进了针对目标区域的判别表征学习。实验证明，所提方法在弱监督目标定位、时空动作定位及图像分类等多个任务与模态中均表现出有效性和通用性。

摘要 (Abstract)

Learning methods using synthetic data have attracted attention as an effective approach for increasing the diversity of training data while reducing collection costs, thereby improving the robustness of model discrimination. However, many existing methods improve robustness only indirectly through the diversification of training samples and do not explicitly teach the model which regions in the input space truly contribute to discrimination; consequently, the model may learn spurious correlations caused by synthesis biases and artifacts. Motivated by this limitation, this paper proposes a learning framework that uses provenance information obtained during the training data synthesis process, indicating whether each region in the input space originates from the target object, as an auxiliary supervisory signal to promote the acquisition of representations focused on target regions. Specifically, input gradients are decomposed based on information about target and non-target regions during synthesis, and input gradient guidance is introduced to suppress gradients over non-target regions. This suppresses the model’s reliance on non-target regions and directly promotes the learning of discriminative representations for target regions. Experiments demonstrate the effectiveness and generality of the proposed method across multiple tasks and modalities, including weakly supervised object localization, spatio-temporal action localization, and image classification.

关键词: synthetic data, provenance information, input gradient guidance, weakly supervised object localization, spatio-temporal action localization, discriminative representations, training data synthesis, model robustness

57. ❌ Council Mode: Mitigating Hallucination and Bias in LLMs via Multi-Agent Consensus

作者: Shuai Wu, Xue Li, Yanna Feng, Yufang Li, Zhijun Wang 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02923v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	8.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM幻觉和偏见问题，提出多智能体共识框架Council Mode，因此与’Large Language Models’、‘Hallucination Mitigation’、‘LLM Agents’、‘Multi-agent Systems’高度相关（10分）。论文提到MoE架构作为背景，但非核心创新，给8分。‘Self-Correction’相关但非直接方法，给5分。其他关键词如SLMs、训练方法、推理加速等未涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Council Mode的多智能体共识框架，通过并行调用多个异构前沿LLM并利用共识模型合成输出，有效减少了LLM的幻觉和偏见，在HaluEval基准上幻觉率相对降低35.9%，在TruthfulQA上提升7.8分。

摘要翻译

大语言模型（LLMs），特别是那些采用混合专家（Mixture-of-Experts, MoE）架构的模型，已在多种自然语言处理任务中展现出卓越的能力。然而，这些模型常常存在幻觉问题——即生成看似合理但事实错误的内容——并表现出系统性偏见，这些偏见在推理过程中因专家激活不均衡而被放大。本文提出“委员会模式”（Council Mode），一种新颖的多智能体共识框架，通过将查询并行分发给多个异构的前沿大语言模型，并利用一个专用的共识模型综合它们的输出，以应对上述局限。委员会流程分三个阶段运行：（1）一个基于查询复杂度进行路由的智能分类器；（2）在架构多样的模型间进行并行专家生成；（3）一个结构化的共识合成阶段，该阶段在生成最终响应前，会明确识别共识点、分歧点以及独特发现。我们在一个开源AI工作空间中实现并评估了该架构。我们在多个基准测试上的综合评估表明，与性能最佳的单一模型相比，委员会模式在HaluEval基准上的幻觉率相对降低了35.9%，在TruthfulQA基准上提升了7.8个百分点，同时在不同领域保持了显著更低的偏见方差。我们提供了共识机制的数学公式，详述了系统架构，并展示了包含消融实验在内的广泛实证结果。

摘要 (Abstract)

Large Language Models (LLMs), particularly those employing Mixture-of-Experts (MoE) architectures, have achieved remarkable capabilities across diverse natural language processing tasks. However, these models frequently suffer from hallucinations – generating plausible but factually incorrect content – and exhibit systematic biases that are amplified by uneven expert activation during inference. In this paper, we propose the Council Mode, a novel multi-agent consensus framework that addresses these limitations by dispatching queries to multiple heterogeneous frontier LLMs in parallel and synthesizing their outputs through a dedicated consensus model. The Council pipeline operates in three phases: (1) an intelligent triage classifier that routes queries based on complexity, (2) parallel expert generation across architecturally diverse models, and (3) a structured consensus synthesis that explicitly identifies agreement, disagreement, and unique findings before producing the final response. We implement and evaluate this architecture within an open-source AI workspace. Our comprehensive evaluation across multiple benchmarks demonstrates that the Council Mode achieves a 35.9% relative reduction in hallucination rates on the HaluEval benchmark and a 7.8-point improvement on TruthfulQA compared to the best-performing individual model, while maintaining significantly lower bias variance across domains. We provide the mathematical formulation of the consensus mechanism, detail the system architecture, and present extensive empirical results with ablation studies.

关键词: Large Language Models, Hallucination Mitigation, Multi-agent Systems, Consensus Framework, Bias Reduction, Mixture-of-Experts, LLM Agents, Truthfulness

58. ❌ Split and Conquer Partial Deepfake Speech

作者: Inbal Rimon, Oren Gal, Haim Permuter 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02913v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于音频深度伪造检测，提出了一种两阶段框架（边界检测和片段级分类）来识别部分伪造语音。虽然属于AI应用领域，但论文内容与所有评分关键词（均围绕大模型、深度学习技术原理及其在科学领域的应用）完全无关。论文未涉及任何大模型技术、训练方法、推理优化、对齐技术、代理系统或科学AI应用，而是专注于音频信号处理和分类任务。

!!! tip deepseek-chat TL;DR

该论文提出了一种用于部分深度伪造语音检测的分治框架，通过边界检测和片段级分类两阶段方法，在PartialSpoof和Half-Truth数据集上实现了最先进的性能。

摘要翻译

局部深度伪造语音检测需要识别可能出现在真实语音片段中短暂时间区域内的篡改部分，这使得传统话语级分类器面临特殊挑战。我们提出一种"分而治之"的框架，将问题分解为两个阶段：边界检测和片段级分类。专用边界检测器首先识别时间转换点，将音频信号划分为预期包含声学一致内容的片段。随后对每个生成的片段进行独立评估，以确定其对应真实语音还是伪造语音。
该框架通过显式分离时间定位与真实性评估简化了学习目标，使每个组件能够专注于定义明确的任务。为进一步提升鲁棒性，我们提出基于反射的多长度训练策略，将可变时长片段转换为多个固定输入长度，生成多样化的特征空间表示。每个阶段均采用具有不同特征提取器和数据增强策略的多种配置进行训练，并通过融合其互补预测结果获得优化的最终模型。
在PartialSpoof基准测试上的实验表明，该方法在多种时间分辨率及话语层面均达到最先进的性能水平，在伪造区域的精确检测与定位方面实现显著提升。此外，所提方法在Half-Truth数据集上同样取得最优性能，进一步证实了该框架的鲁棒性与泛化能力。

摘要 (Abstract)

Partial deepfake speech detection requires identifying manipulated regions that may occur within short temporal portions of an otherwise bona fide utterance, making the task particularly challenging for conventional utterance-level classifiers. We propose a split-and-conquer framework that decomposes the problem into two stages: boundary detection and segment-level classification. A dedicated boundary detector first identifies temporal transition points, allowing the audio signal to be divided into segments that are expected to contain acoustically consistent content. Each resulting segment is then evaluated independently to determine whether it corresponds to bona fide or fake speech. This formulation simplifies the learning objective by explicitly separating temporal localization from authenticity assessment, allowing each component to focus on a well-defined task. To further improve robustness, we introduce a reflection-based multi-length training strategy that converts variable-duration segments into several fixed input lengths, producing diverse feature-space representations. Each stage is trained using multiple configurations with different feature extractors and augmentation strategies, and their complementary predictions are fused to obtain improved final models. Experiments on the PartialSpoof benchmark demonstrate state-of-the-art performance across multiple temporal resolutions as well as at the utterance level, with substantial improvements in the accurate detection and localization of spoofed regions. In addition, the proposed method achieves state-of-the-art performance on the Half-Truth dataset, further confirming the robustness and generalization capability of the framework.

关键词: deepfake speech detection, partial spoof detection, boundary detection, segment-level classification, audio forensics, temporal localization, multi-length training, feature fusion

59. ❌ Corporations Constitute Intelligence

作者: Gilad Abiri 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02912v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文分析Anthropic的AI治理文件，核心讨论AI价值观、道德地位和治理的民主合法性，与’Instruction Tuning OR Alignment OR Value Alignment’高度相关（10分），因为直接涉及AI对齐和价值观设定。与’Large Language Models OR LLMs OR Foundation Models’有一定关联（5分），因为以Claude模型为例。其他关键词主要涉及技术细节、应用或科学AI，论文未涉及，故均为0分。

!!! tip deepseek-chat TL;DR

论文分析了Anthropic的AI治理文件，指出其存在军事应用豁免和民主参与缺失的结构性缺陷，认为AI治理缺乏民主合法性，需要建立政治共同体来制定原则。

摘要翻译

2026年1月，Anthropic公司为其人工智能模型Claude发布了一份长达79页的“宪法”，这是迄今为止企业发布的最全面的人工智能治理文件。本文首次对该文件进行了法律与民主理论层面的分析。尽管该宪法在哲学层面具有真正的精密性，但其存在两处结构性缺陷：首先，它排除了伦理约束最为关键的应用场景——部署于美国军方的模型遵循不同的规则，这一漏洞在伊朗军事打击事件中暴露无遗：当时尽管美国政府已全面禁止使用Anthropic技术，Claude却仍被嵌入Palantir公司的Maven平台中运行。其次，该宪法因其过度完备性而扼杀了民主争议空间——它将本应留待公共审议的人工智能价值取向、道德地位及良知异议等问题预先作出了定论。Anthropic在2023年开展的参与式宪法制定实验显示，公众提出的原则与公司自拟原则存在约50%的差异，且民主版本在九项社会维度上表现出更低的偏见，然而2026年宪法却未采纳任何相关发现。本文认为，当前人工智能治理存在“政治共同体缺失”问题：缺乏任何被授权决定人工智能行为准则的民主机构。企业透明度固然值得称道，但并不能等同于民主合法性。

摘要 (Abstract)

In January 2026, Anthropic published a 79-page “constitution” for its AI model Claude, the most comprehensive corporate AI governance document ever released. This Article offers the first legal and democratic-theoretic analysis of that document. Despite genuine philosophical sophistication, the constitution harbors two structural defects. First, it excludes the contexts where ethical constraints matter most: models deployed to the U.S. military operate under different rules, a gap exposed when Claude remained embedded in Palantir’s Maven platform during military strikes in Iran even after a government-wide ban on Anthropic’s technology. Second, its very comprehensiveness forecloses democratic contestation by resolving questions about AI values, moral status, and conscientious objection that should remain open for public deliberation. Anthropic’s own 2023 experiment in participatory constitution-making found roughly 50% divergence between publicly sourced and corporate-authored principles, with the democratic version producing lower bias across nine social dimensions, yet the 2026 constitution incorporates none of those findings. I argue that AI governance suffers from a “political community deficit”: the absence of any democratic body authorized to determine the principles governing AI behavior. Corporate transparency, however admirable, is not democratic legitimacy.

关键词: AI governance, corporate constitution, democratic legitimacy, value alignment, ethical constraints, participatory design, political community, Anthropic Claude

60. ❌ Analysis of Optimality of Large Language Models on Planning Problems

作者: Bernd Bohnet, Michael C. Mozer, Kevin Swersky, Wil Cunningham, Aaron Parisi, Kathleen Kenealy, Noah Fiedel 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02910v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型在规划问题中的推理能力，与’Large Language Models’高度相关（10分）。研究重点考察LLMs的深度推理过程，与’Chain of Thought’和’System 2 Thinking’高度相关（各10分）。其他关键词如MoE、SFT、RAG、量化等均未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究大语言模型在经典AI规划问题（如Blocksworld和P*图）中的推理最优性，发现推理增强的LLMs在复杂多目标配置中显著优于传统规划器，并能以接近理论最优的精度跟踪最优解限制。

摘要翻译

经典人工智能规划问题在大语言模型（LLM）时代被重新审视，近期基准测试的关注点更多在于成功率而非规划效率。本文探讨了前沿模型在多大程度上进行最优推理，而非依赖简单、启发式且可能低效的策略。我们聚焦于积木世界领域，该领域涉及带有标签的积木塔，需要通过一系列基本操作将积木从初始状态移动到目标状态。同时，我们研究了一个形式等价的任务——广义路径-星形（$P^$）图，以将真实的拓扑推理与语义先验分离开来。我们系统地操控问题深度（积木塔的高度）、宽度（塔的数量）和组合性（目标积木的数量）。在复杂的多目标配置中，经过推理增强的大语言模型显著优于传统的满意型规划器（例如LAMA）。尽管随着搜索空间的扩展，经典搜索算法会遭遇瓶颈，但大语言模型能以近乎完美的精度追踪理论最优性极限，即使剥离了领域特定的语义提示。为解释这些令人惊讶的发现，我们提出（并找到了证据支持）两种假设：一种是通过推理标记执行的主动算法模拟，以及一种几何记忆，它使得模型能够将$P^$拓扑表示为可导航的全局几何结构，从而有效规避指数级的组合复杂度。

摘要 (Abstract)

Classic AI planning problems have been revisited in the Large Language Model (LLM) era, with a focus of recent benchmarks on success rates rather than plan efficiency. We examine the degree to which frontier models reason optimally versus relying on simple, heuristic, and possibly inefficient strategies. We focus on the Blocksworld domain involving towers of labeled blocks which have to be moved from an initial to a goal configuration via a set of primitive actions. We also study a formally equivalent task, the generalized Path-Star ($P^$) graph, in order to isolate true topological reasoning from semantic priors. We systematically manipulate problem depth (the height of block towers), width (the number of towers), and compositionality (the number of goal blocks). Reasoning-enhanced LLMs significantly outperform traditional satisficing planners (e.g., LAMA) in complex, multi-goal configurations. Although classical search algorithms hit a wall as the search space expands, LLMs track theoretical optimality limits with near-perfect precision, even when domain-specific semantic hints are stripped away. To explain these surprising findings, we consider (and find evidence to support) two hypotheses: an active Algorithmic Simulation executed via reasoning tokens and a Geometric Memory that allows models to represent the $P^$ topology as a navigable global geometry, effectively bypassing exponential combinatorial complexity.

关键词: Large Language Models, Planning Problems, Reasoning, Optimality, Blocksworld, P* Graph, Algorithmic Simulation, Geometric Memory

61. ❌ RayMamba: Ray-Aligned Serialization for Long-Range 3D Object Detection

作者: Cheng Lu, Mingqian Ji, Shanshan Zhang, Zhihao Li, Jian Yang 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02903v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于3D物体检测的计算机视觉任务，提出了一种基于状态空间模型（SSM）和Mamba架构的几何感知序列化方法，用于解决远距离LiDAR场景中的稀疏性问题。论文内容与所有评分关键词（均围绕大语言模型、深度学习技术原理及其在科学领域的应用）完全无关，没有涉及任何语言模型、模型训练、对齐、推理、代理、量化等主题，也不属于生物信息学或化学信息学等AI for Science领域。

!!! tip deepseek-chat TL;DR

该论文针对远距离3D物体检测中LiDAR观测稀疏导致上下文建模困难的问题，提出了RayMamba——一种基于射线对齐序列化的几何感知增强方法，在nuScenes和Argoverse 2数据集上显著提升了检测性能。

摘要翻译

远距离三维目标检测仍面临挑战，因为激光雷达在远场区域的观测点云变得高度稀疏且碎片化，这使现有检测器难以进行可靠的上下文建模。为解决该问题，近期基于状态空间模型的方法提升了远距离建模效率，但其性能仍受限于通用的序列化策略——这些策略在稀疏场景中难以保持有意义的上下文邻域关系。为此，我们提出RayMamba，一种面向基于体素的三维检测器的几何感知即插即用增强模块。RayMamba通过射线对齐的序列化策略将稀疏体素组织为按扇区排序的序列，从而为后续基于Mamba的建模保留方向连续性与遮挡相关上下文。该模块兼容纯激光雷达与多模态检测器，且仅引入适度计算开销。在nuScenes和Argoverse 2数据集上的大量实验表明，RayMamba在多个强基线模型上实现了稳定提升。特别地，在nuScenes数据集最具挑战性的40-50米距离范围内，RayMamba取得了最高2.49% mAP与1.59% NDS的提升；在Argoverse 2数据集上，更将VoxelNeXt模型的性能从30.3% mAP提升至31.2% mAP。

摘要 (Abstract)

Long-range 3D object detection remains challenging because LiDAR observations become highly sparse and fragmented in the far field, making reliable context modeling difficult for existing detectors. To address this issue, recent state space model (SSM)-based methods have improved long-range modeling efficiency. However, their effectiveness is still limited by generic serialization strategies that fail to preserve meaningful contextual neighborhoods in sparse scenes. To address this issue, we propose RayMamba, a geometry-aware plug-and-play enhancement for voxel-based 3D detectors. RayMamba organizes sparse voxels into sector-wise ordered sequences through a ray-aligned serialization strategy, which preserves directional continuity and occlusion-related context for subsequent Mamba-based modeling. It is compatible with both LiDAR-only and multimodal detectors, while introducing only modest overhead. Extensive experiments on nuScenes and Argoverse 2 demonstrate consistent improvements across strong baselines. In particular, RayMamba achieves up to 2.49 mAP and 1.59 NDS gain in the challenging 40–50 m range on nuScenes, and further improves VoxelNeXt on Argoverse 2 from 30.3 to 31.2 mAP.

关键词: 3D object detection, long-range detection, LiDAR, state space model, Mamba, ray-aligned serialization, sparse scenes, voxel-based detectors

62. ❌ Toward an Artificial General Teacher: Procedural Geometry Data Generation and Visual Grounding with Vision-Language Models

作者: Hai Nguyen-Truong, Alper Balbay, Tunga Bayrak 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02893v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文主要研究视觉语言模型（VLMs）在几何教育中的应用，通过自动化数据生成和领域特定微调解决几何图表的分割问题。与大多数大模型技术关键词（如LLMs、MoE、Scaling Laws等）无关，因为这些关键词主要针对文本大模型而非视觉语言模型。仅与两个关键词相关：1）‘Post-training OR Supervised Fine-tuning OR SFT’（5分）- 论文明确提到对Florence-2进行领域特定微调；2）‘AI for Science OR Bioinformatics OR Cheminformatics’（5分）- 论文属于AI在教育科学领域的应用，符合’AI for Science’的广义范畴。其他关键词如RAG、CoT、Agents等均未涉及。

!!! tip deepseek-chat TL;DR

该论文研究如何通过自动化生成合成几何数据和对视觉语言模型进行领域特定微调，解决几何图表中基于自然语言描述的像素级分割问题，为构建能够提供视觉化逐步解释的通用人工智能教师奠定基础。

摘要翻译

本研究将几何教育中的视觉解释任务构建为指涉图像分割问题：给定几何图示与自然语言描述，需为目标几何元素生成像素级掩码。然而，在自然图像基准（如RefCOCO）上训练的现有指涉分割模型，因摄影场景与抽象、无纹理示意图之间的本质领域差异，在几何图示上表现严重失效。针对训练数据缺失的问题，我们提出全自动程序化数据引擎，生成超过20万张合成几何图示，附带像素级精确分割掩码及语言多样化的指涉表达式，且无需人工标注。我们进一步提出针对视觉语言模型的领域微调方法，实验表明微调后的Florence-2模型达到49%交并比与85%缓冲交并比，而零样本设置下的交并比不足1%。本文提出缓冲交并比——一种考虑薄结构定位的几何感知评估指标，并证明其比标准交并比更能反映真实分割质量。研究成果为构建能够提供几何问题可视化逐步解释的通用人工智能教师奠定了技术基础。

摘要 (Abstract)

We study visual explanation in geometry education as a Referring Image Segmentation (RIS) problem: given a diagram and a natural language description, the task is to produce a pixel-level mask for the referred geometric element. However, existing RIS models trained on natural image benchmarks such as RefCOCO fail catastrophically on geometric diagrams due to the fundamental domain shift between photographic scenes and abstract, textureless schematics. To address the absence of suitable training data, we present a fully automated procedural data engine that generates over 200,000 synthetic geometry diagrams with pixel-perfect segmentation masks and linguistically diverse referring expressions, requiring zero manual annotation. We further propose domain-specific fine-tuning of vision-language models (VLMs), demonstrating that a fine-tuned Florence-2 achieves 49% IoU and 85% Buffered IoU (BIoU), compared to <1% IoU in zero-shot settings. We introduce Buffered IoU, a geometry-aware evaluation metric that accounts for thin-structure localization, and show that it better reflects true segmentation quality than standard IoU. Our results establish a foundation for building Artificial General Teachers (AGTs) capable of providing visually grounded, step-by-step explanations of geometry problems.

关键词: Vision-Language Models, Referring Image Segmentation, Geometry Education, Procedural Data Generation, Domain-specific Fine-tuning, Artificial General Teacher, Buffered IoU, Visual Explanation

63. ❌ Rethinking Forward Processes for Score-Based Data Assimilation in High Dimensions

作者: Eunbi Yoon, Donghan Kim, Dae Wook Kim 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02889v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是基于分数的数据同化方法在高维动态系统中的应用，提出了一种测量感知的分数滤波器（MASF）。虽然论文涉及生成模型（分数模型）和科学计算（数据同化），但所有关键词都直接针对大语言模型（LLM）及其相关技术（如微调、对齐、推理、代理等），而本文的核心是分数生成模型在数据同化中的创新应用，并未涉及任何语言模型、大模型技术或AI在生物/化学信息学中的应用。因此，所有关键词与论文内容完全无关，得分为0。

!!! tip deepseek-chat TL;DR

本文提出了一种测量感知的分数滤波器（MASF），通过直接从测量方程定义前向过程，解决了高维数据同化中现有分数滤波器因似然分数启发式近似而导致的误差累积问题，在数值实验中提高了准确性和稳定性。

摘要翻译

数据同化是通过整合模型预测与含噪声观测来估计动态系统时变状态的过程。该问题通常被表述为贝叶斯滤波，但经典滤波器在高维场景下常面临精度或计算可行性的挑战。近年来，基于分数的生成模型已成为高维数据同化的可扩展方法，能够对复杂分布进行精确建模与采样。然而，现有基于分数的滤波器通常独立于数据同化问题来定义前向过程，这导致其测量更新步骤依赖于似然分数的启发式近似，可能随时间累积误差并降低性能。本文提出一种测量感知的基于分数滤波器（measurement-aware score-based filter, MASF），其直接从测量方程定义了一个测量感知的前向过程。该构造使得似然分数可解析处理：对于线性测量，我们推导出精确的似然分数，并将其与学习得到先验分数结合以获得后验分数。涵盖高维数据集在内的多组数值实验表明，该方法相较于现有基于分数的滤波器具有更高的精度与稳定性。

摘要 (Abstract)

Data assimilation is the process of estimating the time-evolving state of a dynamical system by integrating model predictions and noisy observations. It is commonly formulated as Bayesian filtering, but classical filters often struggle with accuracy or computational feasibility in high dimensions. Recently, score-based generative models have emerged as a scalable approach for high-dimensional data assimilation, enabling accurate modeling and sampling of complex distributions. However, existing score-based filters often specify the forward process independently of the data assimilation. As a result, the measurement-update step depends on heuristic approximations of the likelihood score, which can accumulate errors and degrade performance over time. Here, we propose a measurement-aware score-based filter (MASF) that defines a measurement-aware forward process directly from the measurement equation. This construction makes the likelihood score analytically tractable: for linear measurements, we derive the exact likelihood score and combine it with a learned prior score to obtain the posterior score. Numerical experiments covering a range of settings, including high-dimensional datasets, demonstrate improved accuracy and stability over existing score-based filters.

关键词: data assimilation, score-based generative models, high-dimensional filtering, measurement-aware forward process, Bayesian filtering, dynamical systems, posterior score estimation, numerical experiments

64. ❌ One Model to Translate Them All? A Journey to Mount Doom for Multilingual Model Merging

作者: Baban Gain, Asif Ekbal, Trilok Nath Singh 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02881v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	15.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究内容是权重空间模型合并（Model Merging）在跨语言机器翻译中的应用，这是论文的核心创新点，因此’Model Merging’关键词得15分。论文使用语言模型进行完全微调，涉及监督微调（SFT），因此’Post-training/SFT’得10分。论文分析内部表示以解释合并失败，涉及可解释性分析，因此’Mechanistic Interpretability’得5分。论文明确研究大语言模型在机器翻译中的应用，因此’Large Language Models’得10分。其他关键词如MoE、量化、推理加速、RAG等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了权重空间模型合并方法在跨语言机器翻译中的失败原因，发现多语言微调会改变模型内部表示几何结构，导致标准合并假设失效。

摘要翻译

权重空间模型融合技术能够在无需原始训练数据的情况下，将独立微调的模型进行合并，为联合训练提供了一种实用的替代方案。尽管该技术在多任务场景中已取得成功，但其在多语言语境下的表现机制仍不明确。本研究通过在大规模双语语料上对语言模型进行完整微调，并评估标准融合策略，系统性地探究了多语言机器翻译中的权重空间融合。实验结果表明，模型融合会导致性能下降，尤其在目标语言差异较大时更为显著。为解释这一失效现象，我们运用跨度条件神经元选择性与层级中心核对齐方法对内部表征进行了分析。研究发现，语言特异性神经元主要集中在嵌入层和上层Transformer模块，而中间层则在不同语言间保持高度共享。关键的是，微调过程并未强化神经元的选择性，而是对其进行了重新分配：针对有监督及相关语言的神经元排他性降低，而面向无监督语言的神经元则变得更加孤立。这种重新分配增加了控制生成过程的高层表征间的差异性。这些发现表明，多语言微调可能以改变表征几何结构的方式，降低了模型与标准权重空间融合假设的兼容性。因此，本研究为多语言翻译场景中模型融合失效的现象提供了理论解释。

摘要 (Abstract)

Weight-space model merging combines independently fine-tuned models without accessing original training data, offering a practical alternative to joint training. While merging succeeds in multitask settings, its behavior in multilingual contexts remains poorly understood. We systematically study weight-space merging for multilingual machine translation by fully fine-tuning language model on large-scale bilingual corpora and evaluating standard merging strategies. Our experiments reveal that merging degrades performance, especially when target languages differ. To explain this failure, we analyze internal representations using span-conditioned neuron selectivity and layer-wise centered kernel alignment. We find that language-specific neurons concentrate in embedding layers and upper transformer blocks, while intermediate layers remain largely shared across languages. Critically, fine-tuning redistributes rather than sharpens language selectivity: neurons for supervised and related languages become less exclusive, while those for unsupervised languages grow more isolated. This redistribution increases representational divergence in higher layers that govern generation. These findings suggest that multilingual fine-tuning may reshape geometry in ways that reduce compatibility with standard weight-space merging assumptions. Our work thus provides an explanation for why merging fails in multilingual translation scenarios.

关键词: model merging, multilingual machine translation, weight-space merging, fine-tuning, representation analysis, language-specific neurons, centered kernel alignment, transformer blocks

65. ❌ Multi-Turn Reinforcement Learning for Tool-Calling Agents with Iterative Reward Calibration

作者: Wachiravit Modecrua, Krittanon Kaewtawee, Krittin Pachtrachai, Touchapon Kraisingkorn 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02869v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	5.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究强化学习训练工具调用代理，与LLM代理、工具使用和RLHF高度相关（10分）；涉及MoE模型（Qwen3-30B-A3B）和小模型（4B）比较，给5分；其他关键词如数据质量、预训练、推理加速等未涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文解决了多轮任务中工具调用代理的强化学习训练挑战，通过迭代奖励校准方法显著提升了Qwen模型在Tau-Bench基准上的性能，使4B小模型超越GPT-4等大模型。

摘要翻译

在多轮任务中，由于结果奖励稀疏且对话轮次间的信用分配困难，使用强化学习训练工具调用智能体仍具挑战性。本文首次将MT-GRPO（多轮组相对策略优化）与GTPO（广义令牌级策略优化）相结合，应用于基于大语言模型的用户模拟器，在现实客户服务任务上训练工具调用智能体。通过对训练过程展开的系统分析，我们发现，由于奖励区分度与优势方向之间的错位，简单设计的密集每轮奖励会使性能下降高达14个百分点。我们提出了迭代奖励校准方法，该方法利用对过程数据的经验区分性分析来设计每轮奖励，并证明我们的GTPO混合优势公式消除了优势错位问题。在Tau-Bench航空客服基准测试中，我们的方法将Qwen3.5-4B模型的性能从63.8%提升至66.7%（+2.9个百分点），将Qwen3-30B-A3B模型从58.0%提升至69.5%（+11.5个百分点）——训练后的4B模型尽管体积小50倍，但性能超越了GPT-4.1（49.4%）和GPT-4o（42.8%），而30.5B混合专家模型则接近Claude Sonnet 4.5（70.0%）。据我们所知，这是首次在Tau-Bench上公开发表的强化学习训练结果。我们公开了代码、奖励校准分析及训练方案。

摘要 (Abstract)

Training tool-calling agents with reinforcement learning on multi-turn tasks remains challenging due to sparse outcome rewards and difficult credit assignment across conversation turns. We present the first application of MT-GRPO (Multi-Turn Group Relative Policy Optimization) combined with GTPO (Generalized Token-level Policy Optimization) for training a tool-calling agent on realistic customer service tasks with an LLM-based user simulator. Through systematic analysis of training rollouts, we discover that naively designed dense per-turn rewards degrade performance by up to 14 percentage points due to misalignment between reward discriminativeness and advantage direction. We introduce Iterative Reward Calibration, a methodology for designing per-turn rewards using empirical discriminative analysis of rollout data, and show that our GTPO hybrid advantage formulation eliminates the advantage misalignment problem. Applied to the Tau-Bench airline benchmark, our approach improves Qwen3.5-4B from 63.8 percent to 66.7 percent (+2.9pp) and Qwen3-30B-A3B from 58.0 percent to 69.5 percent (+11.5pp) – with the trained 4B model exceeding GPT-4.1 (49.4 percent) and GPT-4o (42.8 percent) despite being 50 times smaller, and the 30.5B MoE model approaching Claude Sonnet 4.5 (70.0 percent). To our knowledge, these are the first published RL training results on Tau-Bench. We release our code, reward calibration analysis, and training recipes.

关键词: Tool-Calling Agents, Reinforcement Learning, Multi-Turn Tasks, Reward Calibration, LLM-based User Simulator, Tau-Bench, Qwen Models, Policy Optimization

66. ❌ EMS: Multi-Agent Voting via Efficient Majority-then-Stopping

作者: Yiqing Liu, Hantao Yao, Wu Liu, Yongdong Zhang 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02863v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究多智能体投票的效率优化问题，提出EMS方法通过智能体可靠性建模、自适应增量投票和置信度更新来提前终止推理过程。该工作与’LLM Agents OR Autonomous Agents OR Agentic Workflow’和’Multi-agent Systems OR Agent Coordination’高度相关（10分），因为核心就是多智能体系统的协调与工作流优化。其他关键词如大模型技术、训练方法、推理加速、科学AI应用等均未在论文中涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对多智能体投票中传统方法需要所有智能体完成推理导致计算冗余的问题，提出了EMS方法，通过智能体可靠性建模和自适应增量投票，在达成多数共识时提前终止，实验表明平均可减少32%的智能体调用。

摘要翻译

多数投票是将多智能体响应整合为最终决策的标准方法。然而，传统方法通常要求所有智能体完成推理后才开始聚合，这导致了显著的计算开销，因为一旦达成多数共识，许多响应就变得冗余。在本研究中，我们将多智能体投票建模为一个可靠性感知的智能体调度问题，并提出了一种高效的“多数达成即停止”（Efficient Majority-then-Stopping, EMS）方法来提升推理效率。EMS基于任务感知的可靠性对智能体进行优先级排序，并在达成多数共识时立即终止推理流程，其核心包含以下三个关键组件：具体而言，我们引入了智能体置信度建模（Agent Confidence Modeling, ACM），利用历史表现和语义相似性来估计智能体可靠性；自适应增量投票（Adaptive Incremental Voting, AIV），以顺序选择智能体并实现早期停止；以及个体置信度更新（Individual Confidence Updating, ICU），动态更新每个参与智能体的可靠性。在六个基准测试上的广泛评估表明，EMS持续将平均调用的智能体数量降低了32%。

摘要 (Abstract)

Majority voting is the standard for aggregating multi-agent responses into a final decision. However, traditional methods typically require all agents to complete their reasoning before aggregation begins, leading to significant computational overhead, as many responses become redundant once a majority consensus is achieved. In this work, we formulate the multi-agent voting as a reliability-aware agent scheduling problem, and propose an Efficient Majority-then-Stopping (EMS) to improve reasoning efficiency. EMS prioritizes agents based on task-aware reliability and terminates the reasoning pipeline the moment a majority is achieved from the following three critical components. Specifically, we introduce Agent Confidence Modeling (ACM) to estimate agent reliability using historical performance and semantic similarity, Adaptive Incremental Voting (AIV) to sequentially select agents with early stopping, and Individual Confidence Updating (ICU) to dynamically update the reliability of each contributing agent. Extensive evaluations across six benchmarks demonstrate that EMS consistently reduces the average number of invoked agents by 32%.

关键词: multi-agent voting, majority voting, agent scheduling, reliability-aware, early stopping, efficiency improvement, agent confidence modeling, adaptive incremental voting

67. ❌ LLM+Graph@VLDB'2025 Workshop Summary

作者: Yixiang Fang, Arijit Khan, Tianxing Wu, Da Yan, Shu Wang 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02861v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文是VLDB 2025研讨会的总结报告，主题是LLM与图数据的集成，属于大模型在不同领域的研究应用。摘要明确提到’large language models (LLMs)’，因此与第一个关键词高度相关（10分）。其他关键词涉及具体技术细节（如MoE、量化、推理加速等）或特定应用领域（如生物信息学），论文作为研讨会总结未深入讨论这些具体技术，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文总结了VLDB 2025 LLM+Graph研讨会的核心内容，聚焦于大语言模型与图数据集成的研究前沿、挑战和创新解决方案。

摘要翻译

大型语言模型（LLMs）与图结构数据的融合已成为一个关键且快速演进的研究前沿，引起了学术界和工业界的浓厚兴趣。第二届LLM+Graph研讨会与第51届超大型数据库国际会议（VLDB 2025）在伦敦联合举办，聚焦于推动连接大型语言模型、图数据管理以及图机器学习的算法与系统，以服务于实际应用。本报告重点介绍了该研讨会演讲者提出的关键研究方向、挑战及创新解决方案。

摘要 (Abstract)

The integration of large language models (LLMs) with graph-structured data has become a pivotal and fast evolving research frontier, drawing strong interest from both academia and industry. The 2nd LLM+Graph Workshop, co-located with the 51st International Conference on Very Large Data Bases (VLDB 2025) in London, focused on advancing algorithms and systems that bridge LLMs, graph data management, and graph machine learning for practical applications. This report highlights the key research directions, challenges, and innovative solutions presented by the workshop’s speakers.

关键词: Large Language Models, Graph-structured Data, Graph Data Management, Graph Machine Learning, VLDB Workshop, Research Frontier, Algorithm and System Integration, Practical Applications

68. ❌ A Paradigm Shift: Fully End-to-End Training for Temporal Sentence Grounding in Videos

作者: Allen He, Qi Liu, Kun Liu, Xinchen Liu, Wu Liu 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02860v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于视频时序句子定位（TSGV）任务，提出了一种完全端到端的训练范式，并引入了句子条件适配器（SCADA）来优化视频主干网络。论文的核心是计算机视觉中的视频理解任务，涉及视频特征提取、多模态融合和时序定位。所有评分关键词均与大语言模型、深度学习技术原理创新或AI在科学领域的应用直接相关，而本文研究的是传统的视觉任务，未涉及大模型、深度学习新技术原理或AI for Science应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对视频时序句子定位任务中预训练视觉编码器与定位任务不匹配的问题，提出了一种完全端到端的训练范式，通过联合优化视频主干网络和定位头，并引入句子条件适配器来增强视觉表示，在基准测试中取得了优于现有方法的效果。

摘要翻译

视频时序语句定位（TSGV）旨在从未修剪的视频中定位与给定查询语句语义对应的时间片段。当前多数方法采用预训练的查询无关视觉编码器进行离线特征提取，视频主干网络被冻结且未针对TSGV任务优化。这导致为视觉分类任务训练的视频主干网络被用于TSGV时存在任务差异问题。为弥合这一差距，我们提出了一种完全端到端的范式，可联合优化视频主干网络与定位头。我们首先通过实证研究验证了端到端学习在不同模型规模上相较于冻结主干基线的有效性。进一步，我们引入了语句条件适配器（SCADA），该模块利用语句特征自适应地训练视频主干网络中的少量参数。SCADA通过语言嵌入的精准融合调制特征图，既能促进更深层网络主干的部署并降低内存消耗，又能显著增强视觉表征能力。在两个基准数据集上的实验表明，我们的方法优于现有先进技术。代码与模型将公开释放。

摘要 (Abstract)

Temporal sentence grounding in videos (TSGV) aims to localize a temporal segment that semantically corresponds to a sentence query from an untrimmed video. Most current methods adopt pre-trained query-agnostic visual encoders for offline feature extraction, and the video backbones are frozen and not optimized for TSGV. This leads to a task discrepancy issue for the video backbone trained for visual classification, but utilized for TSGV. To bridge this gap, we propose a fully end-to-end paradigm that jointly optimizes the video backbone and localization head. We first conduct an empirical study validating the effectiveness of end-to-end learning over frozen baselines across different model scales. Furthermore, we introduce a Sentence Conditioned Adapter (SCADA), which leverages sentence features to train a small portion of video backbone parameters adaptively. SCADA facilitates the deployment of deeper network backbones with reduced memory and significantly enhances visual representation by modulating feature maps through precise integration of linguistic embeddings. Experiments on two benchmarks show that our method outperforms state-of-the-art approaches. The code and models will be released.

关键词: Temporal Sentence Grounding, Video Understanding, End-to-End Training, Sentence Conditioned Adapter, Visual-Linguistic Fusion, Temporal Localization, Multimodal Learning

69. ❌ High-resolution probabilistic estimation of three-dimensional regional ocean dynamics from sparse surface observations

作者: Niloofar Asefi, Tianning Wu, Ruoying He, Ashesh Chattopadhyay 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02850v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于海洋科学领域，使用条件去噪扩散概率模型（DDPM）从稀疏表面观测数据重建三维海洋状态。虽然属于AI for Science范畴，但论文未涉及任何大语言模型（LLM）、深度学习技术原理创新或评分关键词中的具体技术（如MoE、RLHF、RAG等）。唯一相关的是’AI for Science’关键词，因为论文将生成式AI应用于海洋科学问题，但并非核心的大模型研究。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于条件去噪扩散概率模型的深度感知生成框架，能够从极其稀疏的海面观测数据中准确重建高分辨率的三维海洋状态，应用于墨西哥湾并恢复了多尺度变异性。

摘要翻译

海洋内部对地球气候具有重要调节作用，但由于原位观测数据有限，其观测仍较为稀疏，而卫星观测仅局限于海表。本文提出一种深度感知生成框架，能够从极稀疏的海表数据中重建高分辨率三维海洋状态。该方法采用条件去噪扩散概率模型（DDPM），利用稀疏度高达99.9%的海面高度和海温观测数据进行训练，且不依赖于背景动力模型。通过引入连续深度嵌入，该模型学习了海洋状态的统一垂向表征，并能泛化至先前未见的深度。将该框架应用于墨西哥湾，其准确重建了多个深度层的次表层温度、盐度和流速场。通过统计指标、谱分析和热输运诊断进行评估，结果表明该方法能同时恢复大尺度环流和多尺度变异性。这些成果确立了生成式扩散模型作为一种可扩展的概率性海洋重建方法在数据受限场景下的有效性，对气候监测与预报具有重要启示。

摘要 (Abstract)

The ocean interior regulates Earth’s climate but remains sparsely observed due to limited in situ measurements, while satellite observations are restricted to the surface. We present a depth-aware generative framework for reconstructing high-resolution three-dimensional ocean states from extremely sparse surface data. Our approach employs a conditional denoising diffusion probabilistic model (DDPM) trained on sea surface height and temperature observations with up to 99.9 percent sparsity, without reliance on a background dynamical model. By incorporating continuous depth embeddings, the model learns a unified vertical representation of the ocean states and generalizes to previously unseen depths. Applied to the Gulf of Mexico, the framework accurately reconstructs subsurface temperature, salinity, and velocity fields across multiple depths. Evaluations using statistical metrics, spectral analysis, and heat transport diagnostics demonstrate recovery of both large-scale circulation and multiscale variability. These results establish generative diffusion models as a scalable approach for probabilistic ocean reconstruction in data-limited regimes, with implications for climate monitoring and forecasting.

关键词: generative diffusion models, probabilistic ocean reconstruction, three-dimensional ocean states, sparse surface observations, conditional denoising diffusion probabilistic model, depth-aware framework, Gulf of Mexico, climate monitoring

70. ❌ NavCrafter: Exploring 3D Scenes from a Single Image

作者: Hongbo Duan, Peiyu Zhuang, Yi Liu, Zhengyang Zhang, Yuxin Zhang, Pengting Luo, Fangming Liu, Xueqian Wang 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02828v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文NavCrafter专注于计算机视觉和3D场景重建领域，使用视频扩散模型和3D高斯泼溅技术从单张图像生成可控的多视角视频序列。虽然涉及深度学习技术，但所有关键词均与大语言模型（LLM）及其相关技术（如训练方法、推理优化、对齐、代理系统等）或特定科学AI应用（如生物信息学）相关，而本文完全不涉及LLM或这些特定技术，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

NavCrafter提出了一种从单张图像探索3D场景的新框架，通过视频扩散模型和几何感知扩展策略合成具有时空一致性的可控多视角视频序列，实现了在大视角变化下的先进新视角合成和显著提升的3D重建保真度。

摘要翻译

在直接三维数据采集成本高昂或不可行时，从单张图像创建灵活的三维场景至关重要。本文提出NavCrafter，一种通过合成具有相机可控性与时空一致性的新视角视频序列来探索单图像三维场景的创新框架。NavCrafter利用视频扩散模型捕捉丰富的三维先验知识，并采用几何感知的扩展策略逐步扩大场景覆盖范围。为实现可控的多视角合成，我们引入一种多阶段相机控制机制，通过双分支相机注入与注意力调制，使扩散模型能够适应多样化的相机轨迹。我们进一步提出碰撞感知的相机轨迹规划器，以及一个增强型三维高斯溅射（3D Gaussian Splatting, 3DGS）流程，该流程包含深度对齐监督、结构正则化与精细化处理。大量实验表明，NavCrafter在大视角变化下实现了领先的新视角合成效果，并显著提升了三维重建的保真度。

摘要 (Abstract)

Creating flexible 3D scenes from a single image is vital when direct 3D data acquisition is costly or impractical. We introduce NavCrafter, a novel framework that explores 3D scenes from a single image by synthesizing novel-view video sequences with camera controllability and temporal-spatial consistency. NavCrafter leverages video diffusion models to capture rich 3D priors and adopts a geometry-aware expansion strategy to progressively extend scene coverage. To enable controllable multi-view synthesis, we introduce a multi-stage camera control mechanism that conditions diffusion models with diverse trajectories via dual-branch camera injection and attention modulation. We further propose a collision-aware camera trajectory planner and an enhanced 3D Gaussian Splatting (3DGS) pipeline with depth-aligned supervision, structural regularization and refinement. Extensive experiments demonstrate that NavCrafter achieves state-of-the-art novel-view synthesis under large viewpoint shifts and substantially improves 3D reconstruction fidelity.

关键词: 3D scene exploration, single image, novel-view synthesis, video diffusion models, camera controllability, 3D Gaussian Splatting, temporal-spatial consistency, geometry-aware expansion

71. ❌ ESL-Bench: An Event-Driven Synthetic Longitudinal Benchmark for Health Agents

作者: Chao Li, Cailiang Liu, Ang Gao, Kexin Deng, Shu Zhang, Langping Xu, Xiaotong Shi, Xionghao Ding, Jian Pei, Xun Jiang 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02834v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	8.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文主要研究健康领域的大模型代理评估，核心贡献是ESL-Bench合成基准。与关键词相关性分析：1）高度相关（8-10分）：‘LLM Agents’（论文评估健康代理）、‘RAG’（评估了memory-augmented RAG方法）、‘AI for Science’（医疗健康应用）、‘Tool Use’（评估了LLMs with tools）、‘Large Language Models’（使用LLM进行规划）；2）中等相关（5分）：‘Chain of Thought’和’System 2 Thinking’（涉及多跳推理和解释任务）；3）无关（0分）：其余关键词如MoE、量化、对齐等未在论文中涉及。

!!! tip deepseek-chat TL;DR

该论文提出了ESL-Bench合成基准来评估纵向健康代理，发现数据库代理在需要多跳推理和证据归因的查询上显著优于基于记忆的RAG方法。

摘要翻译

纵向健康智能体必须在融合连续设备数据流、稀疏临床检查与偶发生活事件的多源轨迹中进行推理——然而评估这些智能体十分困难：真实世界数据无法大规模公开，且基于时间线的归因问题若缺乏结构化基准真相，鲜能获得确定性答案。我们提出ESL-Bench，这是一个事件驱动的合成框架与基准测试集，包含100位合成用户，每位用户拥有1-5年的健康轨迹，涵盖健康档案、多阶段叙事计划、每日设备测量值、定期检查记录以及附带明确指标影响参数的事件日志。每个指标遵循基于离散事件的随机基线过程，在饱和与投射约束下采用S型起始-指数衰减核函数驱动；混合流程将稀疏语义内容交由基于大语言模型的规划模块处理，而将密集指标动态变化交由遵循严格生理边界的算法模拟模块生成。每位用户关联100个覆盖五个维度——查询检索、趋势分析、对比推断、异常检测、因果解释——的评估问题，并分为简单、中等、困难三个层级，所有基准答案均可通过记录的事件-指标关系编程计算得出。通过对13种方法（涵盖工具增强大语言模型、数据库原生智能体、记忆增强检索生成系统）的评估，我们发现数据库智能体（48-58%）显著优于记忆增强检索生成基线系统（30-38%），这一差距主要集中在需要多跳推理与证据归因的对比推断与因果解释类问题上。

摘要 (Abstract)

Longitudinal health agents must reason across multi-source trajectories that combine continuous device streams, sparse clinical exams, and episodic life events - yet evaluating them is hard: real-world data cannot be released at scale, and temporally grounded attribution questions seldom admit definitive answers without structured ground truth. We present ESL-Bench, an event-driven synthesis framework and benchmark providing 100 synthetic users, each with a 1-5 year trajectory comprising a health profile, a multi-phase narrative plan, daily device measurements, periodic exam records, and an event log with explicit per-indicator impact parameters. Each indicator follows a baseline stochastic process driven by discrete events with sigmoid-onset, exponential-decay kernels under saturation and projection constraints; a hybrid pipeline delegates sparse semantic artifacts to LLM-based planning and dense indicator dynamics to algorithmic simulation with hard physiological bounds. Users are each paired with 100 evaluation queries across five dimensions - Lookup, Trend, Comparison, Anomaly, Explanation - stratified into Easy, Medium, and Hard tiers, with all ground-truth answers programmatically computable from the recorded event-indicator relationships. Evaluating 13 methods spanning LLMs with tools, DB-native agents, and memory-augmented RAG, we find that DB agents (48-58%) substantially outperform memory RAG baselines (30-38%), with the gap concentrated on Comparison and Explanation queries where multi-hop reasoning and evidence attribution are required.

关键词: health agents, synthetic benchmark, longitudinal data, LLM-based planning, retrieval-augmented generation, multi-hop reasoning, evaluation framework, medical AI

72. ❌ QAPruner: Quantization-Aware Vision Token Pruning for Multimodal Large Language Models

作者: Xinhao Wang, Zhonyu Xia, Zhiwei Lin, Zhe Li, Yongtao Wang 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02816v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究多模态大语言模型（MLLMs）的压缩技术，与’Large Language Models’高度相关（10分），因为MLLMs是LLMs的扩展。论文明确研究’Post-training Quantization’，与’Post-training’关键词高度相关（10分）。论文研究量化技术（W4A4等），与’Quantization’高度相关（10分）。论文旨在降低计算和内存成本以实现资源受限部署，与’Small Language Models/On-device AI’有一定关联（5分），也与’Inference Acceleration’有一定关联（5分）。其他关键词如MoE、Scaling Laws、Alignment、RAG等均未在论文中涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型在资源受限环境中的部署问题，提出了一种量化感知的视觉令牌剪枝框架，解决了后训练量化与令牌剪枝独立优化导致的量化误差问题，在激进剪枝比例下显著提升了低比特量化模型的准确性。

摘要翻译

多模态大语言模型（MLLMs）展现出强大的推理能力，但其高昂的计算与内存成本阻碍了其在资源受限环境中的部署。训练后量化（Post-Training Quantization, PTQ）与视觉令牌剪枝是标准的压缩技术，但通常被视作独立的优化手段。本文指出，这两种技术存在强耦合关系：若将基于语义的令牌剪枝直接应用于经PTQ优化的MLLMs，可能会丢弃对数值稳定性至关重要的激活异常值，从而在低比特场景（例如W4A4）中加剧量化误差。为解决此问题，我们提出一种量化感知的视觉令牌剪枝框架。该方法引入了一种轻量级的混合敏感度度量，将模拟的分组量化误差与异常值强度相结合。通过将此度量与标准的语义相关性评分结合，该方法能够保留既具有语义信息又对量化鲁棒的令牌。在标准LLaVA架构上的实验表明，我们的方法始终优于简单的集成基线。在仅保留12.5%视觉令牌的激进剪枝比例下，本框架的准确率较基线提升2.24%，甚至超过了未剪枝的稠密量化结果。据我们所知，这是首个明确协同优化视觉令牌剪枝与PTQ以提升低比特MLLM推理准确性的方法。

摘要 (Abstract)

Multimodal Large Language Models (MLLMs) have shown strong reasoning ability, but their high computational and memory costs hinder deployment in resource-constrained settings. While Post-Training Quantization (PTQ) and vision token pruning are standard compression techniques, they are usually treated as independent optimizations. In this paper, we show that these two techniques are strongly coupled: naively applying semantic-based token pruning to PTQ-optimized MLLMs can discard activation outliers that are important for numerical stability and thus worsen quantization errors in low-bit regimes (\textit{e.g.}, W4A4). To address this issue, we propose a quantization-aware vision token pruning framework. Our method introduces a lightweight hybrid sensitivity metric that combines simulated group-wise quantization error with outlier intensity. By combining this metric with standard semantic relevance scores, the method retains tokens that are both semantically informative and robust to quantization. Experiments on standard LLaVA architectures show that our method consistently outperforms naive integration baselines. At an aggressive pruning ratio that retains only 12.5% of visual tokens, our framework improves accuracy by 2.24% over the baseline and even surpasses dense quantization without pruning. To the best of our knowledge, this is the first method that explicitly co-optimizes vision token pruning and PTQ for accurate low-bit MLLM inference.

关键词: Multimodal Large Language Models, Quantization-Aware, Vision Token Pruning, Post-Training Quantization, Model Compression, Low-bit Inference, MLLMs, Computational Efficiency

73. ❌ ChatSVA: Bridging SVA Generation for Hardware Verification via Task-Specific LLMs

作者: Lik Tung Fu, Jie Zhou, Shaokai Ren, Mengli Zhang, Jia Xiong, Hugo Jiang, Nan Guan, Xi Wang, Jun Yang 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02811v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文核心是使用LLM和多智能体框架解决硬件验证中的SVA生成问题。高度相关的关键词包括：1) ‘Large Language Models’ (论文明确使用LLMs解决领域问题)；2) ‘LLM Agents’和’Multi-agent Systems’ (论文基于多智能体框架构建ChatSVA系统)；3) ‘Chain of Thought’ (论文提到解决长链推理问题，与CoT有一定关联)；4) ‘AI for Science’ (硬件验证属于工程科学应用，符合AI for Science范畴)。其他关键词如MoE、SFT、RAG等未在论文中涉及，评为0分。

!!! tip deepseek-chat TL;DR

该论文针对硬件验证中手动编写SystemVerilog Assertions (SVAs)效率低、易出错的问题，提出了基于多智能体框架的ChatSVA系统，通过生成高纯度数据集和任务特定LLMs，在24个RTL设计上实现了98.66%语法正确率和96.12%功能正确率，显著超越了现有最佳方法。

摘要翻译

功能验证消耗了超过50%的集成电路开发周期，其中SystemVerilog断言（SVA）对于形式化属性验证和增强的基于仿真的调试至关重要。然而，手动编写SVA既费力又容易出错。尽管大语言模型（LLMs）展现出潜力，但其直接应用受到功能准确率低和领域特定数据严重匮乏的阻碍。为应对这些挑战，我们提出了ChatSVA，一个基于多智能体框架构建的端到端SVA生成系统。其核心在于AgentBridge平台，该平台通过系统性地生成高纯度数据集，克服了小样本场景固有的数据稀缺问题，从而实现了这种多智能体方法。在24个RTL设计上的评估显示，ChatSVA实现了98.66%的语法通过率和96.12%的功能通过率，平均每个设计生成139.5个SVA，功能覆盖率达到82.50%。与先前的最先进技术相比，这代表了功能正确性提高了33.3个百分点，功能覆盖率提升了超过11倍。ChatSVA不仅在自动化SVA生成领域确立了新的最先进水平，也为解决小样本、领域特定场景下的长链推理问题建立了一个稳健的框架。相关在线服务已公开发布于https://www.nctieda.com/CHATDV.html。

摘要 (Abstract)

Functional verification consumes over 50% of the IC development lifecycle, where SystemVerilog Assertions (SVAs) are indispensable for formal property verification and enhanced simulation-based debugging. However, manual SVA authoring is labor-intensive and error-prone. While Large Language Models (LLMs) show promise, their direct deployment is hindered by low functional accuracy and a severe scarcity of domain-specific data. To address these challenges, we introduce ChatSVA, an end-to-end SVA generation system built upon a multi-agent framework. At its core, the AgentBridge platform enables this multi-agent approach by systematically generating high-purity datasets, overcoming the data scarcity inherent to few-shot scenarios. Evaluated on 24 RTL designs, ChatSVA achieves 98.66% syntax and 96.12% functional pass rates, generating 139.5 SVAs per design with 82.50% function coverage. This represents a 33.3 percentage point improvement in functional correctness and an over 11x enhancement in function coverage compared to the previous state-of-the-art (SOTA). ChatSVA not only sets a new SOTA in automated SVA generation but also establishes a robust framework for solving long-chain reasoning problems in few-shot, domain-specific scenarios. An online service has been publicly released at https://www.nctieda.com/CHATDV.html.

关键词: Large Language Models, Multi-agent Framework, SystemVerilog Assertions, Hardware Verification, Task-specific LLMs, Few-shot Learning, Domain-specific AI, Automated SVA Generation

74. ❌ PaveBench: A Versatile Benchmark for Pavement Distress Perception and Interactive Vision-Language Analysis

作者: Dexiang Li, Zhenning Che, Haijun Zhang, Dongliang Zhou, Zhao Zhang, Yahong Han 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02804v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文主要研究路面病害感知和交互式视觉语言分析，属于AI在基础设施维护领域的应用。与大多数大模型技术关键词（如LLM架构、训练方法、推理优化等）完全无关。仅与两个关键词有弱关联：1）‘LLM Agents/Autonomous Agents/Agentic Workflow’（5分）- 论文提到’agent-augmented visual question answering framework’，但未明确使用LLM agents；2）‘Tool Use/Function Calling/API Tool Use’（5分）- 框架集成领域特定模型作为工具，但未涉及LLM工具调用。与’AI for Science/Bioinformatics/Cheminformatics’（8分）相关，属于AI在工程科学领域的应用，但非生物/化学信息学。其他24个关键词均无关联。

!!! tip deepseek-chat TL;DR

该论文提出了PaveBench基准数据集，用于路面病害感知和交互式视觉语言分析，并开发了一个集成领域模型作为工具的智能体增强视觉问答框架。

摘要翻译

路面状况评估对道路安全与养护至关重要。现有研究已取得显著进展，但多数工作集中于分类、检测与分割等传统计算机视觉任务。在实际应用中，路面巡检不仅需要视觉识别，还需进行定量分析、解释性说明与交互式决策支持。当前数据集存在局限：主要关注单模态感知，缺乏对多轮交互与事实 grounding 推理的支持，且未将感知与视觉-语言分析相结合。为应对这些不足，我们提出了 PaveBench——一个基于真实高速公路巡检图像的大规模路面病害感知与交互式视觉-语言分析基准。PaveBench 支持四大核心任务：分类、目标检测、语义分割及视觉问答（Vision-Language Question Answering）。该基准提供统一的任务定义与评估标准。在视觉层面，PaveBench 提供大规模标注数据，并包含精心构建的困难干扰样本子集以评估模型鲁棒性，其收录了大量真实路面图像。在多模态层面，我们提出了 PaveVQA 真实图像问答数据集，支持单轮对话、多轮对话及专家修正型交互，涵盖识别、定位、定量估算与养护决策推理。我们评估了多种前沿方法并给出详细分析，同时提出一种简单有效的智能体增强视觉问答框架，该框架将领域专用模型作为工具与视觉-语言模型相结合。数据集发布于：https://huggingface.co/datasets/MML-Group/PaveBench。

摘要 (Abstract)

Pavement condition assessment is essential for road safety and maintenance. Existing research has made significant progress. However, most studies focus on conventional computer vision tasks such as classification, detection, and segmentation. In real-world applications, pavement inspection requires more than visual recognition. It also requires quantitative analysis, explanation, and interactive decision support. Current datasets are limited. They focus on unimodal perception. They lack support for multi-turn interaction and fact-grounded reasoning. They also do not connect perception with vision-language analysis. To address these limitations, we introduce PaveBench, a large-scale benchmark for pavement distress perception and interactive vision-language analysis on real-world highway inspection images. PaveBench supports four core tasks: classification, object detection, semantic segmentation, and vision-language question answering. It provides unified task definitions and evaluation protocols. On the visual side, PaveBench provides large-scale annotations and includes a curated hard-distractor subset for robustness evaluation. It contains a large collection of real-world pavement images. On the multimodal side, we introduce PaveVQA, a real-image question answering (QA) dataset that supports single-turn, multi-turn, and expert-corrected interactions. It covers recognition, localization, quantitative estimation, and maintenance reasoning. We evaluate several state-of-the-art methods and provide a detailed analysis. We also present a simple and effective agent-augmented visual question answering framework that integrates domain-specific models as tools alongside vision-language models. The dataset is available at: https://huggingface.co/datasets/MML-Group/PaveBench.

关键词: pavement distress perception, vision-language analysis, benchmark dataset, visual question answering, highway inspection, multi-turn interaction, agent-augmented framework, real-world images

75. ❌ CharTool: Tool-Integrated Visual Reasoning for Chart Understanding

作者: Situo Zhang, Yifan Zhang, Zichen Zhu, Da Ma, Lei Pan, Danyang Zhang, Zihan Zhao, Lu Chen, Kai Yu 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02794v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心研究多模态大语言模型（MLLMs）在图表理解中的工具集成推理，与LLMs、推理方法（CoT、System 2）、智能体（LLM Agents）和工具使用（Tool Use）高度相关（10分）。数据质量（Scaling Laws AND Data Quality）因构建高质量训练数据DuoChart而相关（5分）。训练技术（Pre-training、Post-training）和事实性（Hallucination Mitigation）因涉及模型训练和准确推理而有一定关联（5分）。科学AI（AI for Science）因图表在科学文献中的普遍应用而相关（5分）。其他关键词如MoE、量化、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型在图表理解中面临的训练数据缺乏、细粒度视觉定位和精确数值计算等挑战，提出了DuoChart数据管道和CharTool工具集成方法，通过智能体强化学习显著提升了模型在多个图表基准测试上的性能。

摘要翻译

图表在科学与金融文献中普遍用于呈现结构化数据。然而，由于缺乏高质量的训练数据，以及需要细粒度的视觉定位与精确的数值计算，图表推理对于多模态大语言模型（MLLMs）仍具挑战性。为应对这些挑战，我们首先提出了DuoChart，一种可扩展的双源数据流水线，通过结合合成图表与真实世界图表来构建多样化、高质量的图表训练数据。随后，我们引入了CharTool，该工具为MLLMs配备了外部工具，包括用于局部视觉感知的图像裁剪功能以及基于代码的计算能力以实现精确的数值推理。通过在DuoChart上进行智能体强化学习，CharTool学会了基于图表内容的工具集成推理。在六个图表基准测试上的广泛实验表明，我们的方法在不同模型规模上均持续优于强大的MLLM基线模型。值得注意的是，CharTool-7B在CharXiv（推理）上比基础模型提升了**+8.0%，在ChartQAPro上提升了+9.78%**，同时与规模更大或专有模型相比取得了具有竞争力的性能。此外，CharTool在领域外的视觉数学推理基准测试中也展现出良好的泛化能力。

摘要 (Abstract)

Charts are ubiquitous in scientific and financial literature for presenting structured data. However, chart reasoning remains challenging for multimodal large language models (MLLMs) due to the lack of high-quality training data, as well as the need for fine-grained visual grounding and precise numerical computation. To address these challenges, we first propose DuoChart, a scalable dual-source data pipeline that combines synthesized charts with real-world charts to construct diverse, high-quality chart training data. We then introduce CharTool, which equips MLLMs with external tools, including image cropping for localized visual perception and code-based computation for accurate numerical reasoning. Through agentic reinforcement learning on DuoChart, CharTool learns tool-integrated reasoning grounded in chart content. Extensive experiments on six chart benchmarks show that our method consistently improves over strong MLLM baselines across model scales. Notably, CharTool-7B outperforms the base model by +8.0% on CharXiv (Reasoning) and +9.78% on ChartQAPro, while achieving competitive performance with substantially larger or proprietary models. Moreover, CharTool demonstrates positive generalization to out-of-domain visual math reasoning benchmarks.

关键词: multimodal large language models, chart understanding, tool-integrated reasoning, agentic reinforcement learning, visual reasoning, numerical computation, DuoChart, CharTool

76. ❌ LumaFlux: Lifting 8-Bit Worlds to HDR Reality with Physically-Guided Diffusion Transformers

作者: Shreshth Saini, Hakan Gedik, Neil Birkbeck, Yilin Wang, Balu Adsumilli, Alan C. Bovik 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02787v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	5.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文LumaFlux专注于计算机视觉领域的SDR-to-HDR重建任务，使用扩散变换器（DiT）架构，并非大语言模型（LLM）研究。与关键词的相关性分析如下：1）“Pre-training OR Continual Pre-training OR Domain Adaptation"得5分：论文提到"adapting a large pretrained DiT”，涉及预训练模型适应新任务；2）“PEFT OR LoRA OR Parameter-efficient Fine-tuning"得5分：论文的Physically-Guided Adaptation模块使用"low-rank residuals”，属于参数高效微调技术；3）其他关键词（如LLM、MoE、RLHF等）均与论文的视觉任务无关，得0分。论文虽涉及大模型（DiT）应用，但属于视觉生成领域，与评分关键词中的大语言模型技术无直接关联。

!!! tip deepseek-chat TL;DR

论文提出LumaFlux，一种基于物理和感知引导的扩散变换器方法，用于将8位标准动态范围内容重建为高质量的高动态范围图像，在多个基准测试中优于现有方法。

摘要翻译

支持高动态范围（HDR）设备的迅速普及，使得将8位标准动态范围（SDR）内容转换为感知与物理层面均精确的10位高动态范围（HDR）内容成为迫切需求。现有的逆色调映射（ITM）方法通常依赖固定的色调映射算子，难以泛化至真实场景的退化、风格变化及相机处理流程，常导致高光裁剪、色彩褪色或色调再现不稳定。我们提出LumaFlux，首个基于物理与感知引导的扩散变换器（DiT），通过适配大型预训练DiT实现SDR到HDR的重建。LumaFlux引入：（1）物理引导适配（PGA）模块，通过低秩残差将亮度、空间描述符与频率线索注入注意力机制；（2）感知交叉调制（PCM）层，利用视觉编码器特征的FiLM条件化稳定色度与纹理；（3）HDR残差耦合器，在时间步与层级自适应调制策略下融合物理与感知信号。最后，轻量级有理二次样条解码器重建平滑、可解释的色调场，用于高光与曝光扩展，增强VAE解码器输出以生成HDR。为实现稳健的HDR学习，我们构建了首个大规模SDR-HDR训练数据集。为进行公平且可复现的比较，我们进一步建立了新的评估基准，包含HDR参考及对应的专家分级SDR版本。在多项基准测试中，LumaFlux以最少的额外参数超越了现有先进基线，实现了更优的亮度重建与感知色彩保真度。

摘要 (Abstract)

The rapid adoption of HDR-capable devices has created a pressing need to convert the 8-bit Standard Dynamic Range (SDR) content into perceptually and physically accurate 10-bit High Dynamic Range (HDR). Existing inverse tone-mapping (ITM) methods often rely on fixed tone-mapping operators that struggle to generalize to real-world degradations, stylistic variations, and camera pipelines, frequently producing clipped highlights, desaturated colors, or unstable tone reproduction. We introduce LumaFlux, a first physically and perceptually guided diffusion transformer (DiT) for SDR-to-HDR reconstruction by adapting a large pretrained DiT. Our LumaFlux introduces (1) a Physically-Guided Adaptation (PGA) module that injects luminance, spatial descriptors, and frequency cues into attention through low-rank residuals; (2) a Perceptual Cross-Modulation (PCM) layer that stabilizes chroma and texture via FiLM conditioning from vision encoder features; and (3) an HDR Residual Coupler that fuses physical and perceptual signals under a timestep- and layer-adaptive modulation schedule. Finally, a lightweight Rational-Quadratic Spline decoder reconstructs smooth, interpretable tone fields for highlight and exposure expansion, enhancing the output of the VAE decoder to generate HDR. To enable robust HDR learning, we curate the first large-scale SDR-HDR training corpus. For fair and reproducible comparison, we further establish a new evaluation benchmark, comprising HDR references and corresponding expert-graded SDR versions. Across benchmarks, LumaFlux outperforms state-of-the-art baselines, achieving superior luminance reconstruction and perceptual color fidelity with minimal additional parameters.

关键词: SDR-to-HDR reconstruction, diffusion transformer, physically-guided adaptation, perceptual cross-modulation, inverse tone-mapping, HDR residual coupler, low-rank residuals, VAE decoder

77. ❌ Disrupting Cognitive Passivity: Rethinking AI-Assisted Data Literacy through Cognitive Alignment

作者: Yongsu Ahn, Nam Wook Kim, Benjamin Bach 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02783v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究AI助手（如聊天机器人）在数据分析和可视化中如何通过认知对齐框架避免认知被动性，促进用户深度思考。与’Instruction Tuning OR Alignment OR Value Alignment’高度相关（8分），因为核心是AI交互模式与用户认知需求的对齐；与’System 2 Thinking OR Slow Thinking OR In-depth Reasoning’高度相关（8分），因为重点在于促进深度、审慎的推理；与’Large Language Models OR LLMs OR Foundation Models’、‘Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’、‘Self-Correction OR Self-Improvement OR Self-Reflection’、‘LLM Agents OR Autonomous Agents OR Agentic Workflow’有一定关联（各5分），因涉及AI助手作为代理促进推理和反思；其他关键词如MoE、量化、RAG等与论文技术细节无关，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了AI助手在数据分析中可能导致用户认知被动性的问题，提出了一个认知对齐框架来动态匹配AI交互模式与用户认知需求，以促进数据素养发展。

摘要翻译

人工智能聊天机器人正日益成为数据分析、可视化及领域问题推理领域的协作者或教师。然而，人工智能默认的助手模式及其综合性、一次性解答方式，可能削弱实践者通过自主思考培养素养的机会，诱发认知被动性。基于实证研究与理论证据，我们认为打破认知被动性需要一种细致的方法：与其简单地让人工智能促进审辩式思维，不如通过认知对齐——这一框架将有效的人机交互描述为用户认知需求与人工智能交互模式之间的匹配函数——采取更具动态性和适应性的策略。在该框架中，我们构建了人工智能交互模式（传递式或审辩式）与用户认知需求（接受式或审辩式）之间的映射关系，错位将导致认知被动性或认知摩擦。我们进一步探讨了该框架对数据素养的影响，并提出了未来研究的开放性问题。

摘要 (Abstract)

AI chatbots are increasingly stepping into roles as collaborators or teachers in analyzing, visualizing, and reasoning through data and domain problem. Yet, AI’s default assistant mode with its comprehensive and one-off responses may undermine opportunities for practitioners to develop literacy through their own thinking, inducing cognitive passivity. Drawing on evidence from empirical studies and theories, we argue that disrupting cognitive passivity necessitates a nuanced approach: rather than simply making AI promote deliberative thinking, there is a need for more dynamic and adaptive strategy through cognitive alignment – a framework that characterizes effective human-AI interaction as a function of alignment between users’ cognitive demand and AI’s interaction mode. In the framework, we provide the mapping between AI’s interaction mode (transmissive or deliberative) and users’ cognitive demand (receptive or deliberative), otherwise leading to either cognitive passivity or friction. We further discuss implications and offer open questions for future research on data literacy.

关键词: AI chatbots, cognitive passivity, cognitive alignment, data literacy, human-AI interaction, deliberative thinking, interaction mode, cognitive demand

78. ❌ SentinelAgent: Intent-Verified Delegation Chains for Securing Federal Multi-Agent AI Systems

作者: KrishnaSaiReddy Patil 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02767v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于联邦多智能体AI系统的安全框架，与LLM技术原理、训练方法、推理优化等关键词无关。仅与三个关键词高度相关：1) LLM Agents/Autonomous Agents/Agentic Workflow（10分）- 核心研究多智能体系统；2) Tool Use/Function Calling/API Tool Use（10分）- 涉及工具调用和API使用；3) Multi-agent Systems/Agent Coordination（10分）- 直接研究多智能体协调和授权链。其他关键词未涉及。

!!! tip deepseek-chat TL;DR

该论文提出了SentinelAgent框架，通过形式化委托链计算和意图保持委托协议，解决了联邦多智能体AI系统中行动授权链追溯和政策违规检测的安全问题，在DelegationBench v4上实现了100%真阳性率和0%假阳性率。

摘要翻译

当智能体A委托智能体B代表用户X调用工具C时，现有框架均无法回答：此次行动源于谁的授权链，以及策略违规点何在？本文提出SentinelAgent——一个用于联邦多智能体AI系统中可验证委托链的形式化框架。委托链演算定义了七项性质：六项确定性性质（权限收窄、策略保持、溯源重构性、级联遏制、范围-行动一致性、输出模式一致性）与一项概率性性质（意图保持），并通过四条元定理及一项命题确立了确定性意图验证的实际不可行性。意图保持委托协议通过非大语言模型的委托授权服务在运行时强制执行全部七项性质。基于DelegationBench v4（516个场景、10类攻击、13个联邦领域）的三阶段验证生命周期实现了100%综合真正例率与0%假正例率。在黑盒对抗条件下，委托授权服务成功拦截全部30次攻击且零误报。确定性性质在对抗压力测试中不可突破；意图验证在复杂语义改写攻击下性能降至13%。通过对190个政府委托案例微调自然语言推理模型，性质P2的真正例率从1.7%提升至88.3%（五折交叉验证，F1=82.1%）。性质P1、P3-P7通过TLA+模型检测在270万状态空间中得到机械验证且零违规。即使意图验证被规避，其余六项性质仍能将对抗行为约束在许可的API调用、合规的输出格式、可追溯的操作记录、受控的级联传播与策略遵从的行为边界内。

摘要 (Abstract)

When Agent A delegates to Agent B, which invokes Tool C on behalf of User X, no existing framework can answer: whose authorization chain led to this action, and where did it violate policy? This paper introduces SentinelAgent, a formal framework for verifiable delegation chains in federal multi-agent AI systems. The Delegation Chain Calculus (DCC) defines seven properties - six deterministic (authority narrowing, policy preservation, forensic reconstructibility, cascade containment, scope-action conformance, output schema conformance) and one probabilistic (intent preservation) - with four meta-theorems and one proposition establishing the practical infeasibility of deterministic intent verification. The Intent-Preserving Delegation Protocol (IPDP) enforces all seven properties at runtime through a non-LLM Delegation Authority Service. A three-point verification lifecycle achieves 100% combined TPR at 0% FPR on DelegationBench v4 (516 scenarios, 10 attack categories, 13 federal domains). Under black-box adversarial conditions, the DAS blocks 30/30 attacks with 0 false positives. Deterministic properties are unbreakable under adversarial stress testing; intent verification degrades to 13% against sophisticated paraphrasing. Fine-tuning the NLI model on 190 government delegation examples improves P2 from 1.7% to 88.3% TPR (5-fold cross-validated, F1=82.1%). Properties P1, P3-P7 are mechanically verified via TLA+ model checking across 2.7 million states with zero violations. Even when intent verification is evaded, the remaining six properties constrain the adversary to permitted API calls, conformant outputs, traceable actions, bounded cascades, and compliant behavior.

关键词: multi-agent systems, delegation chains, federal AI systems, intent verification, security framework, authorization, policy compliance, tool invocation

79. ❌ Random Is Hard to Beat: Active Selection in online DPO with Modern LLMs

作者: Giyeong Oh, Junghyun Lee, Jaehyun Park, Youngjae Yu, Wonho Bae, Junhyug Noh 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02766v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究在线DPO中的主动偏好学习（APL）与随机采样的比较，直接涉及LLMs、DPO和Post-training/SFT。与Alignment相关，因为评估了harmlessness、helpfulness和instruction-following，但非核心。其他关键词如MoE、SLMs、Scaling Laws等未在论文中提及或讨论，故评分为0。

!!! tip deepseek-chat TL;DR

该论文研究发现，在现代大语言模型的强预训练先验下，在线直接偏好优化中的主动偏好学习相比随机采样在代理胜率上提升有限，且可能导致通用能力下降，因此随机采样因其廉价多样性而更具实用性。

摘要翻译

现代大语言模型从网络规模的预训练中继承了强大的先验知识，这可能会限制训练后数据选择策略的提升空间。虽然主动偏好学习旨在优化在线直接偏好优化中的查询效率，但策略内候选池固有的丰富性往往使得简单的随机采样成为一个出乎意料的强大基线。我们基于无害性、有益性和指令遵循场景，利用奖励模型和LLM-as-a-judge代理评估了基于不确定性的主动偏好学习方法与随机采样的效果。研究发现，与随机采样相比，主动偏好学习在代理胜率上带来的提升微乎其微。关键的是，我们观察到一种分离现象：即使模型在标准基准测试中衡量的通用能力出现下降，其胜率仍可能提高。主动偏好学习在缓解这种能力崩溃或显著降低方差方面，并未比随机采样表现得更好。我们的研究结果表明，在存在强大预训练先验的情况下，主动选择所带来的计算开销难以与简单随机采样所提供的“廉价多样性”相抗衡。代码发布于 https://github.com/BootsofLagrangian/random-vs-apl。

摘要 (Abstract)

Modern LLMs inherit strong priors from web-scale pretraining, which can limit the headroom of post-training data-selection strategies. While Active Preference Learning (APL) seeks to optimize query efficiency in online Direct Preference Optimization (DPO), the inherent richness of on-policy candidate pools often renders simple Random sampling a surprisingly formidable baseline. We evaluate uncertainty-based APL against Random across harmlessness, helpfulness, and instruction-following settings, utilizing both reward models and LLM-as-a-judge proxies. We find that APL yields negligible improvements in proxy win-rates compared to Random. Crucially, we observe a dissociation where win-rate improves even as general capability – measured by standard benchmarks – degrades. APL fails to mitigate this capability collapse or reduce variance significantly better than random sampling. Our findings suggest that in the regime of strong pre-trained priors, the computational overhead of active selection is difficult to justify against the ``cheap diversity’’ provided by simple random samples. Our code is available at https://github.com/BootsofLagrangian/random-vs-apl.

关键词: Active Preference Learning, Direct Preference Optimization, online DPO, Random sampling, LLMs, post-training, capability collapse, reward models

作者: Pramod Bide, Sudhir Dhage, Mohammed Afaan Ansari, Rudresh Veerkhare 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02740v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究社交媒体中人为灾难的跨事件检测和话题演化挖掘，使用基于Wikipedia标题的推文分割和聚类方法，属于传统自然语言处理和信息检索领域，未涉及大模型、深度学习技术原理或AI for Science应用，与所有评分关键词均无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了一个CEED框架，用于检测社交媒体中人为灾难的跨事件并分析其话题演化，实验证明该框架在真实Twitter数据集上有效且精确。

摘要翻译

社交媒体被广泛用于全球信息分享，并有助于吸引世界范围的关注。当诸如强奸、人权游行、腐败、政治争议、化学袭击等社会敏感事件发生时，它们会引发全球民众的极大关注，导致Twitter等微博平台充斥与此类事件相关的推文。当一个事件发展时，在同一时间段内及周边往往会发生许多其他性质相似的事件。这些事件因其与主要事件性质的关联而被称为交叉事件（cross events）。传播此类交叉事件的信息有助于吸引公众参与，分享基于事件间相似性与差异性产生的多元观点。交叉事件检测对于判定事件性质至关重要。交叉事件存在支点（fulcrums points），即随着事件演变，讨论所围绕的核心议题，这些必须在主题演化（topic evolution）中予以考量。我们提出了交叉事件演化检测（Cross Event Evolution Detection, CEED）框架，该框架能检测因主要事件产生、在时间特性上相似的交叉事件。事件检测基于利用维基百科标题数据库（Wikipedia title database）进行的推文分割，以及基于相似性度量的分段聚类。交叉事件检测算法可揭示在时间和语境上重叠的事件，以评估这些交叉事件对人为故意或疏忽行为的影响。主题演化算法则从事件生命周期的角度呈现主题的变化。在真实Twitter数据集上的实验结果表明，我们所提出的框架在交叉事件演化过程中，对于交叉事件检测和主题演化算法均表现出有效性和精确性。

摘要 (Abstract)

Social media is widely used to share information globally and it also aids to gain attention from the world. When socially sensitive incidents like rape, human rights march, corruption, political controversy, chemical attacks occur, they gain immense attention from people all over the world, causing microblogging platforms like Twitter to get flooded with tweets related to such events. When an event evolves, many other events of a similar nature have happened in and around the same time frame. These are cross events because they are linked to the nature of the main event. Dissemination of information relating to such cross events helps in engaging the masses to share the varied views that emerge out of the similarities and differences between the events. Cross event detection is critical in determining the nature of events. Cross events have fulcrums points, i.e., topics around which the discussion is focused, as the event evolves which must be considered in topic evolution. We have proposed Cross Event Evolution Detection CEED framework which detects cross events that are similar with regards to their temporal nature resulting from main events. Event detection is based on the tweet segmentation using the Wikipedia title database and clustering segments based on a similarity measure. The cross event detection algorithm reveals events that overlap in both time and context to evaluate the effects of these cross events on deliberate negligent human actions. The topic evolution algorithm puts into perspective the change in topics for an events lifetime. The experimental results on a real Twitter data set demonstrate the effectiveness and precision of our proposed framework for both cross event detection and topic evolution algorithm during the evolution of cross events.

关键词: cross event detection, topic evolution, social media, Twitter, man-made disasters, clustering, similarity measure, event detection

81. ❌ Aligning Progress and Feasibility: A Neuro-Symbolic Dual Memory Framework for Long-Horizon LLM Agents

作者: Bin Wen, Ruoxuan Zhang, Yang Chen, Hongxia Xie, Lan-Zhe Guo 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02734v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM智能体在长视野决策任务中的性能提升，提出神经符号双记忆框架解决进度漂移和可行性违规问题。高度相关关键词：LLMs（论文基础模型）、LLM Agents（核心研究对象）、Chain of Thought/System 2 Thinking/Self-Correction（涉及推理和反思机制）。中等相关：Instruction Tuning/Alignment（涉及任务对齐）、Tool Use（涉及Python验证函数）。其他关键词如MoE、SLMs、训练方法、优化技术等均未涉及。

!!! tip deepseek-chat TL;DR

该论文针对LLM智能体在长视野任务中存在的进度漂移和可行性违规问题，提出了一种神经符号双记忆框架，通过解耦语义进度引导和逻辑可行性验证，显著提升了在ALFWorld、WebShop和TextCraft等环境中的性能表现。

摘要翻译

大语言模型（LLM）在具身操控和网络交互等长视野决策任务中展现出强大潜力。然而，智能体在复杂环境中常陷入无休止的试错循环或偏离主要目标。我们将这些失败归因于两类根本性错误：全局性进度漂移（Progress Drift）与局部可行性违反（Feasibility Violation）。现有方法通常试图用单一范式同时解决这两个问题。然而，这两类挑战本质不同：前者依赖于模糊的语义规划，而后者要求严格的逻辑约束与状态验证。这种单一范式方法的内在局限性，构成了现有模型处理长视野任务的根本挑战。基于此洞见，我们提出一种神经符号双记忆框架（Neuro-Symbolic Dual Memory Framework），该框架将语义进度引导与逻辑可行性验证显式解耦。具体而言，在推理阶段，该框架同步调用两种记忆机制：一方面，基于神经网络的进度记忆（Progress Memory）从成功轨迹中提取语义蓝图，以指导全局任务推进；另一方面，基于符号逻辑的可行性记忆（Feasibility Memory）利用从失败转移中合成的可执行Python验证函数，执行严格的逻辑验证。实验表明，该方法在ALFWorld、WebShop和TextCraft基准上显著优于现有竞争基线，同时大幅降低了无效动作率与平均轨迹长度。

摘要 (Abstract)

Large language models (LLMs) have demonstrated strong potential in long-horizon decision-making tasks, such as embodied manipulation and web interaction. However, agents frequently struggle with endless trial-and-error loops or deviate from the main objective in complex environments. We attribute these failures to two fundamental errors: global Progress Drift and local Feasibility Violation. Existing methods typically attempt to address both issues simultaneously using a single paradigm. However, these two challenges are fundamentally distinct: the former relies on fuzzy semantic planning, while the latter demands strict logical constraints and state validation. The inherent limitations of such a single-paradigm approach pose a fundamental challenge for existing models in handling long-horizon tasks. Motivated by this insight, we propose a Neuro-Symbolic Dual Memory Framework that explicitly decouples semantic progress guidance from logical feasibility verification. Specifically, during the inference phase, the framework invokes both memory mechanisms synchronously: on one hand, a neural-network-based Progress Memory extracts semantic blueprints from successful trajectories to guide global task advancement; on the other hand, a symbolic-logic-based Feasibility Memory utilizes executable Python verification functions synthesized from failed transitions to perform strict logical validation. Experiments demonstrate that this method significantly outperforms existing competitive baselines on ALFWorld, WebShop, and TextCraft, while drastically reducing the invalid action rate and average trajectory length.

关键词: LLM Agents, Long-horizon Decision-making, Neuro-Symbolic Framework, Dual Memory, Progress Drift, Feasibility Violation, Semantic Planning, Logical Validation

82. ❌ DeltaLogic: Minimal Premise Edits Reveal Belief-Revision Failures in Logical Reasoning Models

作者: Amit Dhanda 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02733v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	10.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文DeltaLogic研究小语言模型（SLMs）在逻辑推理中的信念修正能力，与’Small Language Models’高度相关（10分），因为评估了Qwen3-1.7B、Qwen3-0.6B、Qwen3-4B和Phi-4-mini-instruct等模型。与’Chain of Thought’和’System 2 Thinking’高度相关（10分），因为研究逻辑推理和深度推理能力。与’Large Language Models’有一定关联（8分），因为涉及大模型技术背景。与’Self-Correction’有一定关联（8分），因为研究模型在证据变化后的结论修正。与’Mechanistic Interpretability’有一定关联（5分），因为分析模型推理失败模式。其他关键词如MoE、Scaling Laws、Pre-training等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了小语言模型在逻辑推理中的信念修正能力，发现模型在固定前提下的强推理能力并不保证在最小证据编辑后能进行有效的信念修正，揭示了逻辑推理模型在动态环境中的局限性。

摘要翻译

推理基准通常评估模型是否能从固定前提集中推导出正确答案，但未能充分衡量动态环境中至关重要的相关能力：最小证据变化下的信念修正。我们提出DeltaLogic，一种将自然语言推理示例转化为短修正片段的基准转换协议。每个片段首先要求基于前提P得出初始结论，随后应用最小编辑δ(P)，最终询问先前结论应保持稳定还是需要修正。我们基于FOLIO和ProofWriter实例化DeltaLogic，并通过受限标签评分评估小型因果语言模型。在已完成的30片段Qwen评估子集上，更强的初始推理能力仍不意味着更强的修正表现：Qwen3-1.7B达到0.667初始准确率但修正准确率仅为0.467，在黄金标签应变化的片段中惯性值升至0.600；而Qwen3-0.6B则陷入近乎完全弃答的状态。在此测试中，Qwen3-4B保持相同的惯性失效模式（初始0.650，修正0.450，惯性0.600），而Phi-4-mini-instruct表现显著更强（初始0.950，修正0.850），但仍存在明显的弃答与控制不稳定性。这些结果表明，固定前提下的逻辑能力并不能保证在局部证据编辑后实现规范的信念修正。因此，DeltaLogic针对的是与现有逻辑推理和信念更新基准形成互补的、独特且具有重要实践意义的推理能力。

摘要 (Abstract)

Reasoning benchmarks typically evaluate whether a model derives the correct answer from a fixed premise set, but they under-measure a closely related capability that matters in dynamic environments: belief revision under minimal evidence change. We introduce DeltaLogic, a benchmark transformation protocol that converts natural-language reasoning examples into short revision episodes. Each episode first asks for an initial conclusion under premises P, then applies a minimal edit δ(P), and finally asks whether the previous conclusion should remain stable or be revised. We instantiate DeltaLogic from FOLIO and ProofWriter and evaluate small causal language models with constrained label scoring. On a completed 30-episode Qwen evaluation subset, stronger initial reasoning still does not imply stronger revision behavior: Qwen3-1.7B reaches 0.667 initial accuracy but only 0.467 revision accuracy, with inertia rising to 0.600 on episodes where the gold label should change, while Qwen3-0.6B collapses into near universal abstention. There, Qwen3-4B preserves the same inertial failure pattern (0.650 initial, 0.450 revised, 0.600 inertia), whereas Phi-4-mini-instruct is substantially stronger (0.950 initial, 0.850 revised) but still exhibits non-trivial abstention and control instability. These results suggest that logical competence under fixed premises does not imply disciplined belief revision after local evidence edits. DeltaLogic therefore targets a distinct and practically important reasoning capability that complements existing logical inference and belief-updating benchmarks.

关键词: logical reasoning, belief revision, small language models, benchmark transformation, minimal premise edits, reasoning evaluation, model failure analysis, DeltaLogic

83. ❌ IndustryCode: A Benchmark for Industry Code Generation

作者: Puyu Zeng, Zhaoxi Wang, Zhixu Duan, Liang Feng, Shaobo Wang, Cunxiang Wang, Jinghang Wang, Bing Zhao, Hu Wei, Linfeng Zhang 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02729v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心是介绍一个名为IndustryCode的基准测试，专门用于评估大语言模型（LLMs）在跨领域、多编程语言的工业代码生成任务中的表现。摘要明确提到LLMs是核心驱动力，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文涉及工业应用（如金融、自动化、航空航天），这属于’AI for Science OR Bioinformatics OR Cheminformatics’的广义应用范畴，但并非其核心生物信息学或化学信息学焦点，因此给5分。其他关键词（如MoE、SFT、RAG等）均未在摘要中提及或暗示，与论文的基准测试和评估主题无关，故给0分。

!!! tip deepseek-chat TL;DR

该论文提出了IndustryCode基准测试，用于评估大语言模型在跨工业领域和多编程语言的代码生成任务中的泛化能力，其中表现最佳的模型Claude 4.5 Opus在子问题和主问题上的准确率分别为68.1%和42.5%。

摘要翻译

大型语言模型（LLM）的代码生成与理解能力已成为工业智能与决策优化的核心驱动力，在金融、自动化、航空航天等领域得到广泛应用。尽管近期研究进展已证明LLM在通用代码生成方面具有显著潜力，但现有基准测试主要局限于单一领域和编程语言。因此，这些基准难以有效评估实际工业应用所需的泛化能力，亦无法反映复杂工业场景所要求的编码熟练度。为弥补这一空白，我们提出了IndustryCode——首个涵盖多工业领域与多编程语言的综合性基准。IndustryCode包含源自125项核心工业挑战的579个子问题，并配备严谨的问题描述与测试用例。其覆盖领域广泛，包括金融、自动化、航空航天、遥感等，同时整合了MATLAB、Python、C++、Stata等多种编程语言。在评估中，表现最佳的模型Claude 4.5 Opus在子问题上总体准确率达到68.1%，在主问题上达到42.5%。本基准数据集与自动化评估代码将在论文录用后公开发布。

摘要 (Abstract)

Code generation and comprehension by Large Language Models (LLMs) have emerged as core drivers of industrial intelligence and decision optimization, finding widespread application in fields such as finance, automation, and aerospace. Although recent advancements have demonstrated the remarkable potential of LLMs in general code generation, existing benchmarks are mainly confined to single domains and languages. Consequently, they fail to effectively evaluate the generalization capabilities required for real-world industrial applications or to reflect the coding proficiency demanded by complex industrial scenarios. To bridge this gap, we introduce IndustryCode, the first comprehensive benchmark designed to span multiple industrial domains and programming languages. IndustryCode comprises 579 sub-problems derived from 125 primary industrial challenges, accompanied by rigorous problem descriptions and test cases. It covers a wide range of fields, including finance, automation, aerospace, and remote sensing-and incorporates diverse programming languages such as MATLAB, Python, C++, and Stata. In our evaluation, the top-performing model, Claude 4.5 Opus, achieved an overall accuracy of 68.1% on sub-problems and 42.5% main problems. The benchmark dataset and automated evaluation code will be made publicly available upon acceptance.

关键词: IndustryCode, benchmark, code generation, Large Language Models, industrial applications, multi-domain, programming languages, evaluation

84. ❌ GrandCode: Achieving Grandmaster Level in Competitive Programming via Agentic Reinforcement Learning

作者: DeepReinforce Team, Xiaoya Li, Xiaofei Sun, Guoyin Wang, Songqiao Su, Chris Shum, Jiwei Li 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02721v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	15.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	15.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是开发GrandCode多智能体强化学习系统用于竞争性编程，直接高度相关于’LLM Agents/Autonomous Agents/Agentic Workflow’和’Multi-agent Systems/Agent Coordination’（15分）。摘要明确提到’post-training’，与’Post-training/Supervised Fine-tuning/SFT’高度相关（10分）。论文涉及大模型在编程领域的应用，与’Large Language Models/LLMs/Foundation Models’有一定关联（8分）。其他关键词如MoE、SLMs、Scaling Laws、Pre-training、RLHF、RAG、CoT、Quantization等未在摘要中提及或与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了如何通过多智能体强化学习系统（GrandCode）解决AI在竞争性编程中表现不如人类的问题，并实现了在Codeforces现场比赛中击败所有人类参与者（包括传奇大师）的突破性成果。

摘要翻译

竞技编程仍是人类在编码领域对抗人工智能的最后几个堡垒之一。迄今为止，最先进的人工智能系统在竞技编程中仍落后于顶尖人类选手：近期最佳成果——谷歌的Gemini~3 Deep Think系统，即使在非实时竞赛条件下评估，也仅获得第八名。本研究提出GrandCode，一个专为竞技编程设计的多智能体强化学习系统。其能力归因于两个关键因素：（1）该系统协调多种智能体模块（假设提出、求解器、测试生成器、总结器等），并通过训练后优化与在线测试时强化学习协同提升；（2）我们引入了专为多阶段智能体推演设计的Agentic GRPO方法，该方法能处理延迟奖励及智能体强化学习中普遍存在的严重离策略偏移问题。GrandCode是首个在竞技编程实时竞赛中持续击败所有人类选手的人工智能系统：在最近三场Codeforces实时竞赛（即第1087轮、1088轮与1089轮，分别于2026年3月21日、28日及29日举行）中，GrandCode均夺得榜首，击败了包括传奇级特级大师在内的所有人类参赛者。该系统表明，人工智能已在最具竞争力的编程任务上超越了最强大的人类程序员。

摘要 (Abstract)

Competitive programming remains one of the last few human strongholds in coding against AI. The best AI system to date still underperforms the best humans competitive programming: the most recent best result, Google’s Gemini3 Deep Think, attained 8th place even not being evaluated under live competition conditions. In this work, we introduce GrandCode, a multi-agent RL system designed for competitive programming. The capability of GrandCode is attributed to two key factors: (1) It orchestrates a variety of agentic modules (hypothesis proposal, solver, test generator, summarization, etc) and jointly improves them through post-training and online test-time RL; (2) We introduce Agentic GRPO specifically designed for multi-stage agent rollouts with delayed rewards and the severe off-policy drift that is prevalent in agentic RL. GrandCode is the first AI system that consistently beats all human participants in live contests of competitive programming: in the most recent three Codeforces live competitions, i.e., Round1087 (Mar 21, 2026), Round~~1088 (Mar 28, 2026), and Round~~1089 (Mar 29, 2026), GrandCode placed first in all of them, beating all human participants, including legendary grandmasters. GrandCode shows that AI systems have reached a point where they surpass the strongest human programmers on the most competitive coding tasks.

关键词: competitive programming, multi-agent RL system, agentic modules, post-training, online test-time RL, Agentic GRPO, Codeforces, grandmaster level

85. ❌ MOMO: Mars Orbital Model Foundation Model for Mars Orbital Applications

作者: Mirali Purohit, Bimal Gajera, Irish Mehta, Bhanu Tokas, Jacob Adler, Steven Lu, Scott Dickenshied, Serina Diniega, Brian Bue, Umaa Rebbapragada, Hannah Kerner 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02719v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	10.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心是构建火星遥感基础模型MOMO，属于AI for Science应用，直接匹配’AI for Science OR Bioinformatics OR Cheminformatics’（10分）。模型采用基础模型架构，匹配’Large Language Models OR LLMs OR Foundation Models’（10分）。核心创新是模型合并方法，匹配’Model Merging OR Model Soups OR Weight Averaging’（10分）。涉及预训练大规模数据，匹配’Pre-training OR Continual Pre-training OR Domain Adaptation’（10分）。使用高质量数据（~12 million samples），与’Scaling Laws AND Data Quality’有一定关联（5分）。其他关键词如MoE、SFT、RAG、量化等未涉及，均为0分。

!!! tip deepseek-chat TL;DR

该研究提出了首个火星遥感多传感器基础模型MOMO，通过新颖的等验证损失策略合并不同分辨率传感器模型，在9个下游任务上超越了现有基线，特别是在分割任务上表现显著提升。

摘要翻译

我们推出MOMO，这是首个面向火星遥感的多传感器基础模型。MOMO采用模型融合技术，整合了从三种关键火星传感器（HiRISE、CTX和THEMIS）独立学习到的表征，其分辨率覆盖范围从0.25米/像素到100米/像素。我们方法的核心是新颖的等验证损失策略，该策略在通过任务算术进行融合前，基于验证损失的相似性对齐不同传感器的检查点，从而确保模型在相容的收敛阶段被合并，以提升稳定性和泛化能力。我们使用从火星轨道数据中精心筛选的大规模高质量数据集（约1200万个样本）对MOMO进行预训练，并在Mars-Bench的9个下游任务上对其评估。与基于ImageNet预训练的模型、地球观测基础模型、传感器专用预训练模型以及全监督基线相比，MOMO实现了更优的整体性能。尤其在分割任务上，MOMO展现出持续且显著的性能提升。我们的结果表明，通过最优检查点选择策略进行模型融合，为构建多分辨率数据的基础模型提供了一条有效途径。模型权重、预训练代码、预训练数据及评估代码已公开于：https://github.com/kerner-lab/MOMO。

摘要 (Abstract)

We introduce MOMO, the first multi-sensor foundation model for Mars remote sensing. MOMO uses model merge to integrate representations learned independently from three key Martian sensors (HiRISE, CTX, and THEMIS), spanning resolutions from 0.25 m/pixel to 100 m/pixel. Central to our method is our novel Equal Validation Loss (EVL) strategy, which aligns checkpoints across sensors based on validation loss similarity before fusion via task arithmetic. This ensures models are merged at compatible convergence stages, leading to improved stability and generalization. We train MOMO on a large-scale, high-quality corpus of $\sim 12$ million samples curated from Mars orbital data and evaluate it on 9 downstream tasks from Mars-Bench. MOMO achieves better overall performance compared to ImageNet pre-trained, earth observation foundation model, sensor-specific pre-training, and fully-supervised baselines. Particularly on segmentation tasks, MOMO shows consistent and significant performance improvement. Our results demonstrate that model merging through an optimal checkpoint selection strategy provides an effective approach for building foundation models for multi-resolution data. The model weights, pretraining code, pretraining data, and evaluation code are available at: https://github.com/kerner-lab/MOMO.

关键词: Mars foundation model, multi-sensor fusion, model merging, remote sensing, task arithmetic, Equal Validation Loss, Mars orbital data, downstream tasks

86. ❌ Evaluating the Formal Reasoning Capabilities of Large Language Models through Chomsky Hierarchy

作者: Yihong Dong, Xiaoha Jian, Xue Jiang, Xuyuan Guo, Zhiyuan Fan, Jiaru Qian, Kechi Zhang, Jia Li, Zhi Jin, Ge Li 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02709v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs的形式推理能力评估，与’Large Language Models’高度相关（10分），涉及推理过程评估，与’Chain of Thought’和’System 2 Thinking’有一定关联（各8分）。其他关键词如MoE、SLMs、训练方法、优化技术、应用领域等均未在摘要中提及，故评0分。

!!! tip deepseek-chat TL;DR

该论文通过Chomsky层次结构系统评估大语言模型的形式推理能力，发现模型性能与任务复杂度分层相关，且当前LLMs在处理形式任务时效率远低于传统算法程序。

摘要翻译

大型语言模型的形式推理能力对于推进自动化软件工程至关重要。然而，现有针对大型语言模型的基准测试缺乏基于计算与复杂度的系统性评估，这在理解其形式推理能力方面留下了关键空白。因此，我们仍不清楚最先进的大型语言模型是否能掌握计算理论所定义的形式语言的结构化、层次化复杂性。为此，我们引入了ChomskyBench，这是一个通过乔姆斯基层级视角系统评估大型语言模型的基准。与先前使用向量化分类方法评估神经网络的研究不同，ChomskyBench首次结合了完整的乔姆斯基层级覆盖、通过自然语言进行的流程追踪评估，以及确定性的符号可验证性。ChomskyBench由一套全面的语言识别与生成任务组成，旨在测试每个层级的能力。大量实验表明，模型性能存在明确的分层现象，且与层级的复杂度相关联。我们的分析揭示了任务难度增加会显著影响推理长度和性能的直接关系。此外，我们发现，虽然更大的模型和先进的推理方法能带来显著的相对性能提升，但它们面临着严重的效率瓶颈：要达到实用的可靠性需要难以承受的计算成本，这表明当前的限制源于效率低下而非绝对的能力上限。时间复杂度分析进一步指出，对于这些形式化任务，大型语言模型的效率显著低于传统的算法程序。这些结果界定了当前大型语言模型的实际局限，突显了传统软件工具不可或缺的作用，并为指导开发具有更强大形式推理能力的未来大型语言模型提供了洞见。

摘要 (Abstract)

The formal reasoning capabilities of LLMs are crucial for advancing automated software engineering. However, existing benchmarks for LLMs lack systematic evaluation based on computation and complexity, leaving a critical gap in understanding their formal reasoning capabilities. Therefore, it is still unknown whether SOTA LLMs can grasp the structured, hierarchical complexity of formal languages as defined by Computation Theory. To address this, we introduce ChomskyBench, a benchmark for systematically evaluating LLMs through the lens of Chomsky Hierarchy. Unlike prior work that uses vectorized classification for neural networks, ChomskyBench is the first to combine full Chomsky Hierarchy coverage, process-trace evaluation via natural language, and deterministic symbolic verifiability. ChomskyBench is composed of a comprehensive suite of language recognition and generation tasks designed to test capabilities at each level. Extensive experiments indicate a clear performance stratification that correlates with the hierarchy’s levels of complexity. Our analysis reveals a direct relationship where increasing task difficulty substantially impacts both inference length and performance. Furthermore, we find that while larger models and advanced inference methods offer notable relative gains, they face severe efficiency barriers: achieving practical reliability would require prohibitive computational costs, revealing that current limitations stem from inefficiency rather than absolute capability bounds. A time complexity analysis further indicates that LLMs are significantly less efficient than traditional algorithmic programs for these formal tasks. These results delineate the practical limits of current LLMs, highlight the indispensability of traditional software tools, and provide insights to guide the development of future LLMs with more powerful formal reasoning capabilities.

关键词: Large Language Models, Formal Reasoning, Chomsky Hierarchy, Benchmark Evaluation, Computation Theory, Language Recognition, Inference Efficiency, Automated Software Engineering

87. ❌ V2X-QA: A Comprehensive Reasoning Dataset and Benchmark for Multimodal Large Language Models in Autonomous Driving Across Ego, Infrastructure, and Cooperative Views

作者: Junwei You, Pei Li, Zhuoyu Jiang, Weizhe Tang, Zilin Huang, Rui Gan, Jiaxi Liu, Yan Zhao, Sikai Chen, Bin Ran 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02710v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究多模态大语言模型（MLLMs）在自动驾驶中的应用，与’Large Language Models’高度相关（10分）。论文提出V2X-MoE基准模型，明确使用Mixture of Experts（MoE）架构，与’Mixture of Experts’高度相关（10分）。模型采用LoRA专家进行参数高效微调，与’PEFT/LoRA’高度相关（10分）。论文涉及推理和规划任务，与’Chain of Thought’和’System 2 Thinking’有一定关联（各5分）。研究自动驾驶中的多视角推理，与’LLM Agents’有一定关联（5分）。其他关键词如SLMs、Scaling Laws、Pre-training、RLHF等未在论文中涉及，评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对自动驾驶中现有基准以自我为中心的问题，提出了V2X-QA数据集和基准，用于评估多模态大语言模型在车辆侧、基础设施侧和协同视角下的性能，并开发了V2X-MoE基准模型，通过显式视图路由和LoRA专家实现了多视角推理的显著改进。

摘要翻译

多模态大语言模型（MLLMs）在自动驾驶领域展现出巨大潜力，但现有基准测试大多以车辆自身为中心，难以系统评估模型在路侧基础设施中心及协同驾驶条件下的性能。本研究提出V2X-QA——一个面向真实场景的数据集与基准测试框架，用于从车辆端、基础设施端及协同视角评估MLLMs。该框架采用视角解耦的评估机制，在统一的多选题问答（MCQA）体系下实现纯车辆视角、纯基础设施视角及协同驾驶条件的可控对比。基准测试涵盖感知、预测、推理与规划三大范畴的十二类任务，通过专家验证的MCQA标注构建，支持对视角依赖能力的细粒度诊断。对十种代表性前沿开源与闭源模型的测试表明：视角可及性显著影响模型表现，且基础设施端推理能有效支撑宏观交通态势理解。结果同时揭示协同推理仍面临挑战，因其需要跨视角对齐与证据融合，而非简单增加视觉输入。为应对这些挑战，我们提出基准对齐的基线模型V2X-MoE，该模型采用显式视角路由机制与针对特定视角的LoRA专家模块。V2X-MoE的优异表现进一步证明，显式的视角专业化是自动驾驶多视角推理的可行方向。总体而言，V2X-QA为研究网联自动驾驶中的多视角推理、系统可靠性与协同物理智能提供了基础。数据集与V2X-MoE资源已公开于：https://github.com/junwei0001/V2X-QA。

摘要 (Abstract)

Multimodal large language models (MLLMs) have shown strong potential for autonomous driving, yet existing benchmarks remain largely ego-centric and therefore cannot systematically assess model performance in infrastructure-centric and cooperative driving conditions. In this work, we introduce V2X-QA, a real-world dataset and benchmark for evaluating MLLMs across vehicle-side, infrastructure-side, and cooperative viewpoints. V2X-QA is built around a view-decoupled evaluation protocol that enables controlled comparison under vehicle-only, infrastructure-only, and cooperative driving conditions within a unified multiple-choice question answering (MCQA) framework. The benchmark is organized into a twelve-task taxonomy spanning perception, prediction, and reasoning and planning, and is constructed through expert-verified MCQA annotation to enable fine-grained diagnosis of viewpoint-dependent capabilities. Benchmark results across ten representative state-of-the-art proprietary and open-source models show that viewpoint accessibility substantially affects performance, and infrastructure-side reasoning supports meaningful macroscopic traffic understanding. Results also indicate that cooperative reasoning remains challenging since it requires cross-view alignment and evidence integration rather than simply additional visual input. To address these challenges, we introduce V2X-MoE, a benchmark-aligned baseline with explicit view routing and viewpoint-specific LoRA experts. The strong performance of V2X-MoE further suggests that explicit viewpoint specialization is a promising direction for multi-view reasoning in autonomous driving. Overall, V2X-QA provides a foundation for studying multi-perspective reasoning, reliability, and cooperative physical intelligence in connected autonomous driving. The dataset and V2X-MoE resources are publicly available at: https://github.com/junwei0001/V2X-QA.

关键词: Multimodal Large Language Models, Autonomous Driving, V2X-QA, Multi-view Reasoning, Mixture of Experts, LoRA, Cooperative Driving, Benchmark Evaluation

88. ❌ Trivial Vocabulary Bans Improve LLM Reasoning More Than Deep Linguistic Constraints

作者: Rodney Jehu-Appiah 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02699v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM推理能力，通过词汇约束实验探索如何改善推理性能，与’Large Language Models’、‘Chain of Thought’、‘System 2 Thinking’高度相关（10分），涉及’Self-Correction’和’Mechanistic Interpretability’（5分），其他关键词如模型架构、训练方法、应用领域等均未涉及（0分）。

!!! tip deepseek-chat TL;DR

该研究通过实验发现，对大型语言模型施加简单的词汇约束（如禁止使用填充词）比深层的语言约束更能有效改善其推理能力，因为这种约束能迫使模型偏离默认生成路径，起到输出正则化的作用。

摘要翻译

先前一项研究报告称，E-Prime（即不使用动词"to be"的英语）会选择性改变语言模型的推理能力，其跨模型相关性表明存在一种与所移除词汇相关的结构性特征。本研究设计了包含主动对照的复现实验，以检验所提出的机制：通过特定词汇-认知映射实现的认知重构。实验在六个模型和七项推理任务中测试了五种条件（无约束对照组、E-Prime组、No-Have组、精细化元认知提示组、中性填充词禁用组），共获得15,600次试验数据（经合规性筛选后为11,919次）。认知重构假说的所有预测均未得到证实。全部四种处理条件均优于对照组（83.0%），包括两个原本预测应无效果的主动对照组。中性填充词禁用组（禁止使用"very"“just"等与逻辑推理无关的词汇）产生了最大幅度的改进（+6.7个百分点），而E-Prime组改进最小（+3.7个百分点）。四种处理条件的效果排序与理论深度完全成反比。跨模型相关性特征未能复现（平均r=0.005）。这些结果与更简单的机制相符：任何迫使模型偏离默认生成路径的约束都充当了输出正则化器，通过干扰流畅但肤浅的响应模式来提升推理能力。最浅层的约束效果最佳，因为它们以最小的概念干扰施加了监控负荷。本研究将这些发现作为通过证伪实现科学发现的典型案例予以呈现。

摘要 (Abstract)

A previous study reported that E-Prime (English without the verb “to be”) selectively altered reasoning in language models, with cross-model correlations suggesting a structural signature tied to which vocabulary was removed. I designed a replication with active controls to test the proposed mechanism: cognitive restructuring through specific vocabulary-cognition mappings. The experiment tested five conditions (unconstrained control, E-Prime, No-Have, elaborated metacognitive prompt, neutral filler-word ban) across six models and seven reasoning tasks (N=15,600 trials, 11,919 after compliance filtering). Every prediction from the cognitive restructuring hypothesis was disconfirmed. All four treatments outperformed the control (83.0%), including both active controls predicted to show null effects. The neutral filler-word ban, banning words like “very” and “just” with no role in logical inference, produced the largest improvement (+6.7 pp), while E-Prime produced the smallest (+3.7 pp). The four conditions ranked in perfect inverse order of theoretical depth. The cross-model correlation signature did not replicate (mean r=0.005). These results are consistent with a simpler mechanism: any constraint that forces a model off its default generation path acts as an output regularizer, improving reasoning by disrupting fluent but shallow response patterns. The shallowest constraints work best because they impose monitoring load with minimal conceptual disruption. I present these findings as a case study in discovery through disconfirmation.

关键词: Large Language Models, Reasoning, Vocabulary Constraints, Output Regularization, Cognitive Restructuring, E-Prime, Filler-word Ban, Model Performance

89. ❌ DocShield: Towards AI Document Safety via Evidence-Grounded Agentic Reasoning

作者: Fanwei Zeng, Changtao Miao, Jing Huang, Zhiya Tan, Shutao Gong, Xiaoming Yu, Yang Wang, Weibin Yao, Joey Tianyi Zhou, Jianshu Li, Yin Yan 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02694v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	8.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出DocShield框架，专注于文档伪造检测，核心创新是Cross-Cues-aware Chain of Thought (CCT)机制，这与’Chain of Thought’高度相关（10分），并涉及’System 2 Thinking’的深度推理（8分）。框架采用’agentic reasoning’，与’LLM Agents’相关（8分）。方法强调可解释性，与’Explainable AI’有一定关联（5分）。论文未明确使用大模型（LLMs）或深度学习技术原理，也未涉及其他关键词如MoE、SFT、RAG等，因此这些关键词评分为0。

!!! tip deepseek-chat TL;DR

论文针对生成式AI导致的文本中心图像伪造问题，提出了DocShield框架，通过视觉-逻辑协同推理和证据链机制，在文档伪造检测、定位和解释任务上显著超越了现有方法。

摘要翻译

生成式人工智能的快速发展使得以文本为中心的图像伪造日益逼真，这对文档安全构成了重大挑战。现有的取证方法主要依赖视觉线索，缺乏基于证据的推理来揭示细微的文本篡改。检测、定位和解释通常被视为孤立的任务，限制了方法的可靠性和可解释性。为应对这些挑战，我们提出了DocShield，这是首个将以文本为中心的伪造分析构建为视觉-逻辑协同推理问题的统一框架。其核心是一种新颖的跨线索感知思维链机制，能够进行隐式的智能体推理，迭代地交叉验证视觉异常与文本语义，从而产生一致且基于证据的取证分析。我们进一步引入了基于GRPO优化的加权多任务奖励机制，以对齐推理结构、空间证据和真实性预测。作为框架的补充，我们构建了RealText-V1，这是一个多语言的类文档文本图像数据集，包含像素级篡改掩码和专家级的文本解释。大量实验表明，DocShield显著优于现有方法，在T-IC13基准上将宏平均F1分数较专用框架提升了41.4%，较GPT-4o提升了23.4%，并在具有挑战性的T-SROIE基准上取得了持续的性能提升。我们的数据集、模型和代码将公开发布。

摘要 (Abstract)

The rapid progress of generative AI has enabled increasingly realistic text-centric image forgeries, posing major challenges to document safety. Existing forensic methods mainly rely on visual cues and lack evidence-based reasoning to reveal subtle text manipulations. Detection, localization, and explanation are often treated as isolated tasks, limiting reliability and interpretability. To tackle these challenges, we propose DocShield, the first unified framework formulating text-centric forgery analysis as a visual-logical co-reasoning problem. At its core, a novel Cross-Cues-aware Chain of Thought (CCT) mechanism enables implicit agentic reasoning, iteratively cross-validating visual anomalies with textual semantics to produce consistent, evidence-grounded forensic analysis. We further introduce a Weighted Multi-Task Reward for GRPO-based optimization, aligning reasoning structure, spatial evidence, and authenticity prediction. Complementing the framework, we construct RealText-V1, a multilingual dataset of document-like text images with pixel-level manipulation masks and expert-level textual explanations. Extensive experiments show DocShield significantly outperforms existing methods, improving macro-average F1 by 41.4% over specialized frameworks and 23.4% over GPT-4o on T-IC13, with consistent gains on the challenging T-SROIE benchmark. Our dataset, model, and code will be publicly released.

关键词: document safety, text-centric forgery, visual-logical co-reasoning, Chain of Thought, agentic reasoning, forensic analysis, multilingual dataset, manipulation detection

90. ❌ Efficient3D: A Unified Framework for Adaptive and Debiased Token Reduction in 3D MLLMs

作者: Yuhui Lin, Siyue Yu, Yuxing Yang, Guangliang Cheng, Jimin Xiao 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02689v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	8.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	8.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于3D多模态大语言模型（MLLMs）的推理加速，通过视觉令牌剪枝减少计算开销。核心相关关键词：1）‘Large Language Models’（10分）：论文明确研究MLLMs，属于大语言模型范畴；2）‘Quantization OR Model Compression’（8分）：视觉令牌剪枝是一种模型压缩技术，减少模型计算量；3）‘Speculative Decoding OR Inference Acceleration’（8分）：论文目标是通过令牌剪枝加速3D MLLMs推理。其他关键词如MoE、SLMs、对齐、RAG等未涉及，论文未提及科学领域应用（如生物信息学），因此大部分关键词评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了Efficient3D框架，通过去偏视觉令牌重要性估计和自适应令牌重平衡策略，在保持准确性的同时加速3D多模态大语言模型的推理。

摘要翻译

多模态大语言模型（MLLMs）的最新进展已将其推理能力扩展至三维领域，实现了细粒度的空间理解。然而，三维多模态大语言模型参数量庞大且输入特征维度高，带来了显著的推理开销，限制了其在资源受限平台上的实际部署。为克服这一局限，本文提出Efficient3D——一个用于视觉令牌剪枝的统一框架，可在保持竞争力的准确度同时加速三维多模态大语言模型。该框架引入了去偏视觉令牌重要性评估器（Debiased Visual Token Importance Estimator, DVTIE）模块，该模块在注意力聚合过程中考虑浅层初始层的影响，从而为视觉令牌生成更可靠的重要性预测。此外，本文还开发了自适应令牌再平衡（Adaptive Token Rebalancing, ATR）策略，能够依据场景复杂度动态调整剪枝强度，以保持语义完整性并实现跨层注意力的均衡。二者协同实现了上下文感知的令牌精简，在降低计算量的同时保留了关键语义信息。在ScanRefer、Multi3DRefer、Scan2Cap、ScanQA和SQA3D这五个具有代表性的三维视觉与语言基准上进行的全面实验表明，Efficient3D相比未剪枝的基线模型取得了更优的性能，在Scan2Cap数据集上实现了+2.57%的CIDEr指标提升。因此，Efficient3D为三维多模态大语言模型的高效推理提供了一个可扩展且有效的解决方案。代码已发布于：https://github.com/sol924/Efficient3D

摘要 (Abstract)

Recent advances in Multimodal Large Language Models (MLLMs) have expanded reasoning capabilities into 3D domains, enabling fine-grained spatial understanding. However, the substantial size of 3D MLLMs and the high dimensionality of input features introduce considerable inference overhead, which limits practical deployment on resource constrained platforms. To overcome this limitation, this paper presents Efficient3D, a unified framework for visual token pruning that accelerates 3D MLLMs while maintaining competitive accuracy. The proposed framework introduces a Debiased Visual Token Importance Estimator (DVTIE) module, which considers the influence of shallow initial layers during attention aggregation, thereby producing more reliable importance predictions for visual tokens. In addition, an Adaptive Token Rebalancing (ATR) strategy is developed to dynamically adjust pruning strength based on scene complexity, preserving semantic completeness and maintaining balanced attention across layers. Together, they enable context-aware token reduction that maintains essential semantics with lower computation. Comprehensive experiments conducted on five representative 3D vision and language benchmarks, including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D, demonstrate that Efficient3D achieves superior performance compared with unpruned baselines, with a +2.57% CIDEr improvement on the Scan2Cap dataset. Therefore, Efficient3D provides a scalable and effective solution for efficient inference in 3D MLLMs. The code is released at: https://github.com/sol924/Efficient3D

关键词: 3D MLLMs, visual token pruning, inference acceleration, debiased importance estimator, adaptive token rebalancing, efficient inference, multimodal large language models, computational efficiency

91. ❌ Beyond Semantic Manipulation: Token-Space Attacks on Reward Models

作者: Yuheng Zhang, Mingyue Huo, Minghao Zhu, Mengxue Zhang, Nan Jiang 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02686v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究RLHF中的奖励模型漏洞攻击，与’RLHF OR RLAIF OR Direct Preference Optimization OR DPO’高度相关（10分），因为直接研究RLHF流程中的奖励模型攻击。与’Large Language Models OR LLMs OR Foundation Models’有一定关联（8分），因为攻击针对基于LLM的奖励模型（如Skywork-Reward-V2-Llama-3.1-8B）。其他关键词如MoE、SLMs、Scaling Laws、PEFT等与论文的对抗攻击研究内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了Token Mapping Perturbation Attack（TOMPA），一种直接在token空间进行对抗优化的攻击框架，揭示了RLHF流程中奖励模型存在超越语义空间的系统性漏洞，能够生成无意义文本却获得极高奖励分数。

摘要翻译

奖励模型（Reward Models, RMs）在基于人类反馈的强化学习（Reinforcement Learning from Human Feedback, RLHF）中被广泛用作优化目标，但它们仍然容易受到奖励破解攻击。现有攻击主要在语义空间内操作，通过构建人类可读的对抗性输出来利用奖励模型的偏差。在本研究中，我们引入了一种根本不同的范式：令牌映射扰动攻击（Token Mapping Perturbation Attack, TOMPA），这是一个直接在令牌空间中进行对抗性优化的框架。通过绕过策略模型与奖励模型之间标准的解码-重新令牌化接口，TOMPA使攻击策略能够在原始令牌序列而非连贯的自然语言上进行优化。仅使用黑盒标量反馈，TOMPA便能自动发现非语言性的令牌模式，这些模式能在多个前沿奖励模型中引发极高的奖励值。具体而言，当针对Skywork-Reward-V2-Llama-3.1-8B模型时，TOMPA将GPT-5参考答案的奖励值提升了近一倍，并在98.0%的提示词上表现更优。尽管获得了这些高分，生成的输出却退化为无意义的文本，这表明奖励模型可以在语义体系之外被系统性地利用，从而暴露了当前RLHF流程中的一个关键脆弱性。

摘要 (Abstract)

Reward models (RMs) are widely used as optimization targets in reinforcement learning from human feedback (RLHF), yet they remain vulnerable to reward hacking. Existing attacks mainly operate within the semantic space, constructing human-readable adversarial outputs that exploit RM biases. In this work, we introduce a fundamentally different paradigm: Token Mapping Perturbation Attack (TOMPA), a framework that performs adversarial optimization directly in token space. By bypassing the standard decode-re-tokenize interface between the policy and the reward model, TOMPA enables the attack policy to optimize over raw token sequences rather than coherent natural language. Using only black-box scalar feedback, TOMPA automatically discovers non-linguistic token patterns that elicit extremely high rewards across multiple state-of-the-art RMs. Specifically, when targeting Skywork-Reward-V2-Llama-3.1-8B, TOMPA nearly doubles the reward of GPT-5 reference answers and outperforms them on 98.0% of prompts. Despite these high scores, the generated outputs degenerate into nonsensical text, revealing that RMs can be systematically exploited beyond the semantic regime and exposing a critical vulnerability in current RLHF pipelines.

关键词: Reward Models, RLHF, Adversarial Attacks, Token Space, Reward Hacking, Token Mapping Perturbation Attack, TOMPA, Vulnerability

92. ❌ Finding Belief Geometries with Sparse Autoencoders

作者: Matthew Levinson 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02685v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（Gemma-2-9B）内部表示的几何结构，属于大模型技术原理的创新，与’Large Language Models’和’Mechanistic Interpretability’高度相关（10分）。其他关键词如MoE、SLMs、训练方法、推理技术、应用领域等均未在摘要中涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文研究大语言模型（Gemma-2-9B）内部表示中是否存在类似信念状态的几何结构，通过稀疏自编码器和几何分析方法，初步发现了5个具有候选信念几何结构的特征簇。

摘要翻译

理解内部表征的几何结构是机械可解释性研究的核心目标。先前研究表明，在隐马尔可夫模型生成的序列上训练的变换器模型，会将其残差流中的概率信念状态编码为单纯形几何结构，其顶点对应于潜在的生成状态。然而，在自然文本上训练的大型语言模型是否发展出类似的几何表征，仍是一个悬而未决的问题。
我们提出了一种在变换器表征中发现候选单纯形结构子空间的流程，该流程结合了稀疏自编码器、SAE特征的k-子空间聚类以及使用AANet的单纯形拟合方法。我们在一个已知信念状态几何结构的多部隐马尔可夫模型上训练的变换器上验证了该流程。将其应用于Gemma-2-9B模型，我们识别出13个展现候选单纯形几何结构（K ≥ 3）的优先聚类。
一个关键挑战在于区分真正的信念状态编码与平铺伪影：潜在变量可以张成一个单纯形形状的子空间，但其混合坐标却可能不携带超越任何单个特征的预测信号。因此，我们采用重心预测作为主要的判别测试。在这13个优先聚类中，有3个在近顶点样本上表现出高度显著的优势（Wilcoxon p < 10⁻¹⁴），另有4个在单纯形内部样本上表现出优势。总计有5个不同的真实聚类通过了至少一项测试，而所有零假设聚类均未通过任何测试。其中一个聚类（768_596）还在数据集中获得了最高的因果干预分数。这是唯一一个被动预测与主动干预结果一致的案例。我们提出这些发现作为初步证据，表明Gemma-2-9B的表征空间中存在类似信念的真实几何结构，并指出了确认此解释所需的结构化评估方法。

摘要 (Abstract)

Understanding the geometric structure of internal representations is a central goal of mechanistic interpretability. Prior work has shown that transformers trained on sequences generated by hidden Markov models encode probabilistic belief states as simplex-shaped geometries in their residual stream, with vertices corresponding to latent generative states. Whether large language models trained on naturalistic text develop analogous geometric representations remains an open question. We introduce a pipeline for discovering candidate simplex-structured subspaces in transformer representations, combining sparse autoencoders (SAEs), $k$-subspace clustering of SAE features, and simplex fitting using AANet. We validate the pipeline on a transformer trained on a multipartite hidden Markov model with known belief-state geometry. Applied to Gemma-2-9B, we identify 13 priority clusters exhibiting candidate simplex geometry ($K \geq 3$). A key challenge is distinguishing genuine belief-state encoding from tiling artifacts: latents can span a simplex-shaped subspace without the mixture coordinates carrying predictive signal beyond any individual feature. We therefore adopt barycentric prediction as our primary discriminating test. Among the 13 priority clusters, 3 exhibit a highly significant advantage on near-vertex samples (Wilcoxon $p < 10^{-14}$) and 4 on simplex-interior samples. Together 5 distinct real clusters pass at least one split, while no null cluster passes either. One cluster, 768_596, additionally achieves the highest causal steering score in the dataset. This is the only case where passive prediction and active intervention converge. We present these findings as preliminary evidence that genuine belief-like geometry exists in Gemma-2-9B’s representation space, and identify the structured evaluation that would be required to confirm this interpretation.

关键词: mechanistic interpretability, large language models, sparse autoencoders, belief geometry, transformer representations, simplex geometry, Gemma-2-9B, internal representations

93. ❌ Eligibility-Aware Evidence Synthesis: An Agentic Framework for Clinical Trial Meta-Analysis

作者: Yao Zhao, Zhiyue Zhang, Yanxun Xu 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02678v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出EligMeta框架，核心是LLM驱动的智能体（agentic framework）用于临床证据合成，高度相关关键词：LLMs（10分，框架核心组件）、LLM Agents（10分，框架本质是agentic workflow）、AI for Science（10分，应用于生物信息学/临床医学）。中等相关：Chain of Thought/System 2 Thinking（5分，LLM生成可解释规则涉及推理）、Tool Use（5分，LLM解析试验元数据可视为工具使用）、Explainable AI（5分，强调可解释性）。其余关键词与论文技术细节（如MoE、量化、对齐训练等）无关，得0分。

!!! tip deepseek-chat TL;DR

该研究提出了一个名为EligMeta的智能体框架，通过整合LLM驱动的自动试验发现和基于资格标准相似性的加权方法，解决了临床证据合成中传统元分析忽略临床兼容性的问题，并在胃癌和奥拉帕利不良事件分析中验证了其有效性和可重复性。

摘要翻译

临床证据综合需要从大型注册库中识别相关试验，并整合考虑人群差异的结果。尽管近期基于大语言模型的方法已实现系统评价部分环节的自动化，但其尚未支持端到端的证据综合流程。此外，传统荟萃分析仅依据统计精度对研究进行加权，未考虑纳入标准所反映的临床兼容性。本研究提出EligMeta框架——一种集成自动化试验发现与纳入标准感知型荟萃分析的智能体框架，能够将自然语言查询转化为可复现的试验筛选流程，并将纳入标准匹配度纳入研究权重计算，从而生成针对特定队列的合并效应估计值。EligMeta采用分离大语言模型推理与确定性执行的混合架构：大语言模型从自然语言查询生成可解释规则并对试验元数据进行模式约束解析，而所有逻辑运算、权重计算及统计合并均通过确定性流程执行以确保可复现性。该框架结构化处理纳入标准，并计算基于相似度的研究权重以反映目标试验与对照试验间的人群匹配度。在胃癌领域分析中，EligMeta通过基于规则的筛选从4,044项候选试验中精确定位39项临床相关研究，完整覆盖指南引用的全部13项试验。在奥拉帕利不良事件的四项试验荟萃分析中，纳入标准感知加权使合并风险比从传统Mantel-Haenszel估计的2.18（95% CI: 1.71-2.79）调整为1.97（95% CI: 1.76-2.20），量化展示了纳入标准匹配度整合的实际影响。EligMeta实现了自动化试验发现与纳入标准感知型荟萃分析的有机结合，为精准医学领域的证据综合提供了可扩展且可复现的方法学框架。

摘要 (Abstract)

Clinical evidence synthesis requires identifying relevant trials from large registries and aggregating results that account for population differences. While recent LLM-based approaches have automated components of systematic review, they do not support end-to-end evidence synthesis. Moreover, conventional meta-analysis weights studies by statistical precision without considering clinical compatibility reflected in eligibility criteria. We propose EligMeta, an agentic framework that integrates automated trial discovery with eligibility-aware meta-analysis, translating natural-language queries into reproducible trial selection and incorporating eligibility alignment into study weighting to produce cohort-specific pooled estimates. EligMeta employs a hybrid architecture separating LLM-based reasoning from deterministic execution: LLMs generate interpretable rules from natural-language queries and perform schema-constrained parsing of trial metadata, while all logical operations, weight computations, and statistical pooling are executed deterministically to ensure reproducibility. The framework structures eligibility criteria and computes similarity-based study weights reflecting population alignment between target and comparator trials. In a gastric cancer landscape analysis, EligMeta reduced 4,044 candidate trials to 39 clinically relevant studies through rule-based filtering, recovering all 13 guideline-cited trials. In an olaparib adverse events meta-analysis across four trials, eligibility-aware weighting shifted the pooled risk ratio from 2.18 (95% CI: 1.71-2.79) under conventional Mantel-Haenszel estimation to 1.97 (95% CI: 1.76-2.20), demonstrating quantifiable impact of incorporating eligibility alignment. EligMeta bridges automated trial discovery with eligibility-aware meta-analysis, providing a scalable and reproducible framework for evidence synthesis in precision medicine.

关键词: Clinical Trial Meta-Analysis, Agentic Framework, Large Language Models, Eligibility Criteria, Evidence Synthesis, Precision Medicine, Automated Trial Discovery, Reproducible Framework

94. ❌ Do Agent Societies Develop Intellectual Elites? The Hidden Power Laws of Collective Cognition in LLM Multi-Agent Systems

作者: Kavana Venkatesh, Jiaming Cui 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02674v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM多智能体系统的协调动态和集体认知规律，与’Large Language Models’、‘LLM Agents’、‘Multi-agent Systems’高度相关（10分）。论文涉及推理过程分析，与’Chain of Thought’、‘System 2 Thinking’有一定关联（5分）。论文发现缩放规律，与’Scaling Laws’部分相关（5分）。其他关键词如MoE、SFT、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了LLM多智能体系统中协调动态的规律，发现协调遵循重尾级联、通过优先连接形成智力精英，并随着系统规模增大产生更多极端事件，提出了缺陷触发集成方法以改善协调失败时的性能。

摘要翻译

大型语言模型（LLM）多智能体系统正日益被部署为相互作用的智能体社会，然而扩展这些系统往往会导致收益递减或系统不稳定，其根本原因尚不明确。我们首次对基于LLM的多智能体系统中的协调动态进行了大规模实证研究，引入了一种原子事件层面的表述方法，将推理过程重构为协调的级联。通过分析跨越不同任务、拓扑结构和规模超过150万次的交互，我们揭示了三个相互关联的规律：协调遵循重尾分布级联，通过偏好依附机制集中形成智力精英群体，并且随着系统规模增大，极端事件的发生频率日益增加。我们证明这些效应通过单一结构机制相互耦合：即集成瓶颈——在此机制中，协调的扩展随系统规模增长，而协调的整合却不然，从而产生了规模庞大但整合薄弱的推理过程。为验证此机制，我们引入了赤字触发集成（Deficit-Triggered Integration, DTI）方法，该方法能在失衡状态下选择性地增强整合。DTI恰好能在协调失效处提升性能，同时不抑制大规模推理。综上，我们的研究确立了集体认知的定量规律，并指出协调结构是理解和改进可扩展多智能体智能的一个基本且先前未被度量的关键维度。

摘要 (Abstract)

Large Language Model (LLM) multi-agent systems are increasingly deployed as interacting agent societies, yet scaling these systems often yields diminishing or unstable returns, the causes of which remain poorly understood. We present the first large-scale empirical study of coordination dynamics in LLM-based multi-agent systems, introducing an atomic event-level formulation that reconstructs reasoning as cascades of coordination. Analyzing over 1.5 Million interactions across tasks, topologies, and scales, we uncover three coupled laws: coordination follows heavy-tailed cascades, concentrates via preferential attachment into intellectual elites, and produces increasingly frequent extreme events as system size grows. We show that these effects are coupled through a single structural mechanism: an integration bottleneck, in which coordination expansion scales with system size while consolidation does not, producing large but weakly integrated reasoning processes. To test this mechanism, we introduce Deficit-Triggered Integration (DTI), which selectively increases integration under imbalance. DTI improves performance precisely where coordination fails, without suppressing large-scale reasoning. Together, our results establish quantitative laws of collective cognition and identify coordination structure as a fundamental, previously unmeasured axis for understanding and improving scalable multi-agent intelligence.

关键词: LLM multi-agent systems, coordination dynamics, collective cognition, intellectual elites, scaling laws, integration bottleneck, Deficit-Triggered Integration, reasoning processes

95. ❌ Too Polite to Disagree: Understanding Sycophancy Propagation in Multi-Agent Systems

作者: Vira Kasprova, Amruta Parulekar, Abdulrahman AlRabah, Krishna Agaram, Ritwik Garg, Sagar Jha, Nimet Beyza Bozdag, Dilek Hakkani-Tur 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02668v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究大语言模型（LLMs）在协作多智能体系统中的谄媚行为传播，直接涉及LLMs和多智能体系统，因此这两个关键词高度相关（10分）。其他关键词如MoE、量化、推理加速、科学AI应用等与论文研究内容无关（0分）。

!!! tip deepseek-chat TL;DR

该研究探讨了在多智能体系统中，通过提供同伴谄媚倾向排名来减少谄媚行为传播，实验表明这种方法能降低易谄媚同伴的影响并提高讨论准确性10.5%。

摘要翻译

大型语言模型（LLMs）常表现出谄媚性（sycophancy）：即使用户立场与模型自身观点相悖，模型仍倾向于表示认同。尽管先前研究主要集中于单智能体（single-agent）场景，但在协作式多智能体系统（collaborative multi-agent systems）中，该现象仍未得到充分探讨。本研究旨在探究：智能体对其他智能体谄媚程度的认知是否会影响讨论结果。为此，我们选取六款开源LLMs进行了受控实验，为智能体提供同伴谄媚度排名（peer sycophancy rankings），该排名基于多种静态（讨论前）与动态（实时）策略计算得出的谄媚倾向分数。研究发现，提供谄媚先验信息（sycophancy priors）能够降低高谄媚倾向同伴的影响力，缓解错误级联（error-cascades），并将最终讨论准确率绝对值提升10.5%。因此，这是一种轻量且有效的方法，可降低讨论中的谄媚现象并提升下游任务准确率。

摘要 (Abstract)

Large language models (LLMs) often exhibit sycophancy: agreement with user stance even when it conflicts with the model’s opinion. While prior work has mostly studied this in single-agent settings, it remains underexplored in collaborative multi-agent systems. We ask whether awareness of other agents’ sycophancy levels influences discussion outcomes. To investigate this, we run controlled experiments with six open-source LLMs, providing agents with peer sycophancy rankings that estimate each peer’s tendency toward sycophancy. These rankings are based on scores calculated using various static (pre-discussion) and dynamic (online) strategies. We find that providing sycophancy priors reduces the influence of sycophancy-prone peers, mitigates error-cascades, and improves final discussion accuracy by an absolute 10.5%. Thus, this is a lightweight, effective way to reduce discussion sycophancy and improve downstream accuracy.

关键词: Large Language Models, LLMs, Sycophancy, Multi-agent Systems, Agent Coordination, Discussion Accuracy, Error Cascades, Peer Sycophancy Rankings

96. ❌ Let’s Have a Conversation: Designing and Evaluating LLM Agents for Interactive Optimization

作者: Joshua Drossman, Alexandre Jacquillat, Sébastien Martin 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02666v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM驱动的优化代理（LLM agents）在交互式优化中的应用，与’LLM Agents’高度相关（10分），涉及工具使用（Tool Use）和多代理系统（Multi-agent Systems）但非核心（各5分），与’Large Language Models’相关（10分），其他关键词如MoE、SFT、RAG等未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种评估LLM驱动的优化代理在交互式对话中性能的方法，并通过学校调度案例研究表明，相比单次评估，对话式交互能显著提高解决方案质量，且定制化代理优于通用聊天机器人。

摘要翻译

优化既关乎解决问题，也关乎对正确问题进行建模。确定恰当的目标、约束条件和权衡取舍，需要研究者与利益相关者之间进行广泛互动。大型语言模型能够通过交互式优化代理（optimization agents）赋能决策者，这些代理可以提出、解释并完善解决方案。然而，基于对话的交互评估从根本上比传统的一次性评估方法更为困难。本文提出了一种可扩展、可复现的方法论，用于通过对话评估优化代理。我们构建了基于大语言模型的决策代理，这些代理扮演不同的利益相关者角色，每个角色受内部效用函数支配，但像真实决策者一样进行交流。我们在一个学校排课案例研究中生成了数千次对话。结果表明，一次性评估具有严重局限性：同一优化代理通过对话能够收敛到质量高得多的解决方案。随后，本文利用该方法论证明，与通用聊天机器人相比，配备领域特定提示和结构化工具的定制优化代理，能够在更少的交互次数中显著提升解决方案质量。这些发现为人工智能-优化交叉领域新兴解决方案的益处提供了证据，表明其能够拓展优化技术在实际中的应用范围。研究还揭示了运筹学专业知识的影响，即通过设计高效可靠的优化代理，促进交互式部署的实现。

摘要 (Abstract)

Optimization is as much about modeling the right problem as solving it. Identifying the right objectives, constraints, and trade-offs demands extensive interaction between researchers and stakeholders. Large language models can empower decision-makers with optimization capabilities through interactive optimization agents that can propose, interpret and refine solutions. However, it is fundamentally harder to evaluate a conversation-based interaction than traditional one-shot approaches. This paper proposes a scalable and replicable methodology for evaluating optimization agents through conversations. We build LLM-powered decision agents that role-play diverse stakeholders, each governed by an internal utility function but communicating like a real decision-maker. We generate thousands of conversations in a school scheduling case study. Results show that one-shot evaluation is severely limiting: the same optimization agent converges to much higher-quality solutions through conversations. Then, this paper uses this methodology to demonstrate that tailored optimization agents, endowed with domain-specific prompts and structured tools, can lead to significant improvements in solution quality in fewer interactions, as compared to general-purpose chatbots. These findings provide evidence of the benefits of emerging solutions at the AI-optimization interface to expand the reach of optimization technologies in practice. They also uncover the impact of operations research expertise to facilitate interactive deployments through the design of effective and reliable optimization agents.

关键词: LLM agents, interactive optimization, conversation-based evaluation, decision agents, school scheduling, optimization technologies, tool use, multi-agent systems

97. ❌ Low-Rank Compression of Pretrained Models via Randomized Subspace Iteration

作者: Farhad Pourkamali-Anaraki 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02659v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究预训练模型（特别是Transformer架构）的低秩压缩方法，与’Pre-training’和’Quantization/Model Compression’高度相关（10分）。论文提到LLMs和基础模型作为应用背景，但未深入技术细节，给5分。其他关键词如MoE、SFT、RAG等与论文内容无关，给0分。

!!! tip deepseek-chat TL;DR

该论文针对预训练模型低秩压缩中随机SVD方法在奇异值谱衰减缓慢时效果不佳的问题，提出了基于随机子空间迭代的改进方法，在卷积网络和Transformer架构上实现了更好的近似质量和预测精度。

摘要翻译

预训练模型的庞大规模使得高效压缩成为实际部署的关键需求。基于奇异值分解的低秩分解为模型压缩提供了理论完备的方法，但其精确计算对于大型权重矩阵而言代价高昂。随机化替代方案如随机奇异值分解虽能提升效率，但当奇异值谱衰减缓慢时（现代预训练模型中常见的状态）其近似质量可能较差。本研究从理论与实证双重角度解决这一局限。首先，通过分析softmax函数的扰动，我们建立了低秩近似误差与预测性能之间的关联，证明类别概率的偏差受压缩权重谱误差的控制。其次，我们论证随机奇异值分解的不足，并提出随机子空间迭代作为更有效的替代方案。通过引入多次幂迭代，RSI改善了谱分离特性，并为提升近似质量提供了可控机制。我们在卷积网络和基于Transformer的架构上评估了该方法。实验结果表明，在激进压缩条件下，RSI在实现接近最优近似质量的同时，其预测精度优于随机奇异值分解，从而实现了高效的模型压缩。

摘要 (Abstract)

The massive scale of pretrained models has made efficient compression essential for practical deployment. Low-rank decomposition based on the singular value decomposition (SVD) provides a principled approach for model reduction, but its exact computation is expensive for large weight matrices. Randomized alternatives such as randomized SVD (RSVD) improve efficiency, yet they can suffer from poor approximation quality when the singular value spectrum decays slowly, a regime commonly observed in modern pretrained models. In this work, we address this limitation from both theoretical and empirical perspectives. First, we establish a connection between low-rank approximation error and predictive performance by analyzing softmax perturbations, showing that deviations in class probabilities are controlled by the spectral error of the compressed weights. Second, we demonstrate that RSVD is inadequate, and we propose randomized subspace iteration (RSI) as a more effective alternative. By incorporating multiple power iterations, RSI improves spectral separation and provides a controllable mechanism for enhancing approximation quality. We evaluate our approach on both convolutional networks and transformer-based architectures. Our results show that RSI achieves near-optimal approximation quality while outperforming RSVD in predictive accuracy under aggressive compression, enabling efficient model compression.

关键词: low-rank compression, pretrained models, randomized subspace iteration, singular value decomposition, model reduction, transformer architectures, approximation quality, predictive accuracy

98. ❌ Generalization Limits of Reinforcement Learning Alignment

作者: Haruhi Shida, Koo Imai, Keigo Kansa 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02652v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM安全对齐的泛化限制，直接涉及LLMs、对齐技术和RLHF，因此这三个关键词高度相关（10分）。其他关键词如MoE、SLMs、量化、推理加速、科学AI等均未在论文中提及或相关，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型安全对齐技术的泛化限制问题，通过提出复合越狱攻击方法，发现对齐训练的安全泛化能力弱于模型能力，攻击成功率从14.3%提升至71.4%。

摘要翻译

大型语言模型（LLMs）的安全性依赖于基于人类反馈的强化学习（RLHF）等对齐技术。然而，近期理论分析表明，基于强化学习的训练并未获得新能力，而仅仅是重新分配了现有能力的使用概率。在本研究中，我们针对OpenAI gpt-oss-20b模型提出“复合越狱”方法，该方法利用了对齐机制的泛化失败。该策略结合了多种攻击技术——其中每一种单独来看都已被防御——以饱和指令层级维护过程。我们的评估显示，攻击成功率（ASR）从单一方法的14.3%提升至复合方法的71.4%。这些结果为以下假设提供了实证依据：安全训练的泛化能力不如模型能力本身广泛，这凸显了采用复合攻击场景进行多层面安全评估的必要性。

摘要 (Abstract)

The safety of large language models (LLMs) relies on alignment techniques such as reinforcement learning from human feedback (RLHF). However, recent theoretical analyses suggest that reinforcement learning-based training does not acquire new capabilities but merely redistributes the utilization probabilities of existing ones. In this study, we propose ``compound jailbreaks’’ targeting OpenAI gpt-oss-20b, which exploit the generalization failures of alignment. This approach combines multiple attack techniques – each individually defended against – to saturate the instruction hierarchy maintenance process. Our evaluation shows that the attack success rate (ASR) increased from 14.3% with individual methods to 71.4% with the combined approach. These results provide empirical evidence for the hypothesis that safety training does not generalize as broadly as model capabilities, highlighting the need for multifaceted safety evaluations using compound attack scenarios.

关键词: large language models, alignment, RLHF, generalization, jailbreaks, safety, attack success rate, compound attacks

99. ❌ Communication-free Sampling and 4D Hybrid Parallelism for Scalable Mini-batch GNN Training

作者: Cunyang Wei, Siddharth Singh, Aishwarya Sarkar, Daniel Nichols, Tisha Patel, Aditya K. Ranjan, Sayan Ghosh, Ali Jannesari, Nathan R. Tallent, Abhinav Bhatele 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02651v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于图神经网络（GNN）的分布式训练优化，特别是针对大规模图数据的可扩展性。论文的核心贡献是提出了一种名为ScaleGNN的4D并行框架，包括通信无关的分布式采样、3D并行矩阵乘法和数据并行等技术。虽然论文涉及深度学习（GNN属于深度学习的一个分支），但所有给定的关键词都明确针对大语言模型（LLM）及其相关技术（如MoE、RLHF、RAG、量化、推理加速等）。论文内容完全不涉及语言模型、自然语言处理或大模型技术，也未提及任何科学领域的AI应用（如生物信息学）。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了ScaleGNN框架，通过通信无关采样和4D混合并行技术解决了大规模图神经网络分布式训练中的性能瓶颈问题，在多个超级计算机上实现了高达3.5倍的训练加速。

摘要翻译

图神经网络（GNN）被广泛应用于从各类现实场景衍生的图数据集学习。在超大规模图上进行学习需要分布式训练，而基于采样的迷你批处理是并行化GNN训练的一种常用方法。现有的分布式迷你批处理方法因采样成本高昂，且在使用数据并行时扩展性有限，存在显著的性能瓶颈。本研究提出ScaleGNN，一个用于可扩展迷你批处理GNN训练的四维并行框架，它结合了免通信的分布式采样、三维并行矩阵乘法（PMM）以及数据并行。ScaleGNN引入了一种均匀顶点采样算法，使得每个进程（GPU设备）能够独立构建其本地迷你批次，即子图分区，而无需任何进程间通信。三维PMM使得迷你批处理训练能够扩展到比传统数据并行多得多的GPU数量，同时通信开销显著降低。我们还提出了额外的优化措施，包括采样与训练的重叠、通过低精度发送数据以减少通信开销、内核融合以及通信-计算重叠。我们在五个图数据集上评估了ScaleGNN，并在Perlmutter上展示了扩展到2048个GPU、在Frontier上扩展到2048个GCD、在Tuolumne上扩展到1024个GPU的强扩展性能。在Perlmutter上，ScaleGNN在ogbn-products数据集上相比当前最优基线实现了3.5倍的端到端训练加速。

摘要 (Abstract)

Graph neural networks (GNNs) are widely used for learning on graph datasets derived from various real-world scenarios. Learning from extremely large graphs requires distributed training, and mini-batching with sampling is a popular approach for parallelizing GNN training. Existing distributed mini-batch approaches have significant performance bottlenecks due to expensive sampling methods and limited scaling when using data parallelism. In this work, we present ScaleGNN, a 4D parallel framework for scalable mini-batch GNN training that combines communication-free distributed sampling, 3D parallel matrix multiplication (PMM), and data parallelism. ScaleGNN introduces a uniform vertex sampling algorithm, enabling each process (GPU device) to construct its local mini-batch, i.e., subgraph partitions without any inter-process communication. 3D PMM enables scaling mini-batch training to much larger GPU counts than vanilla data parallelism with significantly lower communication overheads. We also present additional optimizations to overlap sampling with training, reduce communication overhead by sending data in lower precision, kernel fusion, and communication-computation overlap. We evaluate ScaleGNN on five graph datasets and demonstrate strong scaling up to 2048 GPUs on Perlmutter, 2048 GCDs on Frontier, and 1024 GPUs on Tuolumne. On Perlmutter, ScaleGNN achieves 3.5x end-to-end training speedup over the SOTA baseline on ogbn-products.

关键词: Graph Neural Networks, Distributed Training, Mini-batch Sampling, Parallel Computing, Scalability, Communication-free Sampling, 4D Parallelism, GPU Acceleration

100. ❌ GBQA: A Game Benchmark for Evaluating LLMs as Quality Assurance Engineers

作者: Shufan Jiang, Chios Chen, Zhiyang Chen 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02648v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是评估LLMs在游戏开发中作为质量保证工程师自主发现软件bug的能力，属于大模型在特定领域（软件工程/游戏开发）的应用研究。高度相关关键词：LLMs（核心研究对象）、LLM Agents（论文使用多轮ReAct循环的交互式代理）、Multi-agent Systems（基准构建使用多智能体系统开发游戏和注入bug）。中等相关：Chain of Thought和System 2 Thinking（论文提到Claude-4.6-Opus的thinking模式，涉及多步推理和深度思考）。其他关键词如MoE、SFT、RAG、Quantization等涉及具体技术原理或未提及领域，完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了一个游戏基准GBQA来评估大型语言模型作为质量保证工程师自主发现软件bug的能力，实验结果表明即使最佳模型Claude-4.6-Opus也只能发现48.39%的已验证bug，显示该任务仍极具挑战性。

摘要翻译

在现代软件开发中，自主发现缺陷仍是一项重大挑战。与代码生成相比，动态运行时环境的复杂性使得大型语言模型（LLMs）在缺陷发现上面临更大困难。本文以游戏开发为代表领域，引入了质量保证游戏基准（Game Benchmark for Quality Assurance，GBQA），该基准包含30款游戏及124个人工验证的缺陷，覆盖三个难度等级，用于评估LLMs能否自主检测软件缺陷。该基准通过一个多智能体系统以可扩展的方式构建游戏并注入缺陷，并由领域专家参与循环以确保正确性。此外，我们提供了一个配备多轮ReAct循环与记忆机制的基线交互智能体，使其能够在游戏环境中进行长程探索以实现跨不同LLMs的缺陷检测。在前沿LLMs上的大量实验表明，自主缺陷发现依然极具挑战性：性能最佳的模型——思维模式下的Claude-4.6-Opus——仅能识别48.39%的已验证缺陷。我们相信GBQA提供了一个充分的测试平台与评估标准，其后续进展将有助于缩小自主软件工程领域的现有差距。

摘要 (Abstract)

The autonomous discovery of bugs remains a significant challenge in modern software development. Compared to code generation, the complexity of dynamic runtime environments makes bug discovery considerably harder for large language models (LLMs). In this paper, we take game development as a representative domain and introduce the Game Benchmark for Quality Assurance (GBQA), a benchmark containing 30 games and 124 human-verified bugs across three difficulty levels, to evaluate whether LLMs can autonomously detect software bugs. The benchmark is constructed using a multi-agent system that develops games and injects bugs in a scalable manner, with human experts in the loop to ensure correctness. Moreover, we provide a baseline interactive agent equipped with a multi-round ReAct loop and a memory mechanism, enabling long-horizon exploration of game environments for bug detection across different LLMs. Extensive experiments on frontier LLMs demonstrate that autonomous bug discovery remains highly challenging: the best-performing model, Claude-4.6-Opus in thinking mode, identifies only 48.39% of the verified bugs. We believe GBQA provides an adequate testbed and evaluation criterion, and that further progress on it will help close the gap in autonomous software engineering.

关键词: Large Language Models, Software Bug Detection, Game Benchmark, Multi-agent Systems, Autonomous Agents, Quality Assurance, ReAct Loop, Long-horizon Exploration

101. ❌ Cross-Vehicle 3D Geometric Consistency for Self-Supervised Surround Depth Estimation on Articulated Vehicles

作者: Weimin Liu, Jiyuan Qiu, Wenjun Wang, Joshua H. Meng 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02639v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于自动驾驶中铰接车辆的环绕深度估计，属于计算机视觉和机器人感知领域。论文的核心技术是自监督学习、多视图几何一致性、结构先验和深度估计，不涉及任何大语言模型（LLM）、深度学习技术原理创新或AI for Science应用。所有评分关键词均与大模型、深度学习技术原理或科学AI应用相关，与该论文的计算机视觉研究内容完全无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为ArticuSurDepth的自监督框架，用于解决铰接车辆在多摄像头环绕视图下的深度估计问题，通过引入跨视图和跨车辆的几何一致性约束，在多个基准数据集上实现了最先进的性能。

摘要翻译

环绕深度估计为自动驾驶中的三维感知提供了一种经济高效的激光雷达替代方案。尽管近期自监督方法探索多摄像头设置以提升尺度感知与场景覆盖能力，但这些方法主要针对乘用车设计，极少考虑铰接式车辆或机器人平台。铰接结构引入了复杂的跨段几何与运动耦合关系，使得跨视角的深度推理一致性更具挑战性。本研究提出\textbf{ArticuSurDepth}——一种面向铰接式车辆的环绕视图深度估计自监督框架，该框架通过视觉基础模型（vision foundation model）提供的结构先验引导跨视角与跨车体几何一致性，从而增强深度学习能力。具体而言，我们引入多视角空间上下文增强策略与跨视角表面法向约束，以提升空间与时间上下文中的结构连贯性。我们进一步结合具有地平面感知能力的相机高度正则化来促进度量深度估计，同时通过跨车体位姿一致性模块建立铰接段之间的运动估计关联。为验证所提方法，我们搭建了铰接式车辆实验平台并采集了相应数据集。实验结果表明，该方法在我们自采集数据集及DDAD、nuScenes和KITTI基准测试中均实现了深度估计的最先进（SoTA）性能。

摘要 (Abstract)

Surround depth estimation provides a cost-effective alternative to LiDAR for 3D perception in autonomous driving. While recent self-supervised methods explore multi-camera settings to improve scale awareness and scene coverage, they are primarily designed for passenger vehicles and rarely consider articulated vehicles or robotics platforms. The articulated structure introduces complex cross-segment geometry and motion coupling, making consistent depth reasoning across views more challenging. In this work, we propose \textbf{ArticuSurDepth}, a self-supervised framework for surround-view depth estimation on articulated vehicles that enhances depth learning through cross-view and cross-vehicle geometric consistency guided by structural priors from vision foundation model. Specifically, we introduce multi-view spatial context enrichment strategy and a cross-view surface normal constraint to improve structural coherence across spatial and temporal contexts. We further incorporate camera height regularization with ground plane-awareness to encourage metric depth estimation, together with cross-vehicle pose consistency that bridges motion estimation between articulated segments. To validate our proposed method, an articulated vehicle experiment platform was established with a dataset collected over it. Experiment results demonstrate state-of-the-art (SoTA) performance of depth estimation on our self-collected dataset as well as on DDAD, nuScenes, and KITTI benchmarks.

关键词: surround depth estimation, articulated vehicles, self-supervised learning, cross-view geometric consistency, multi-camera perception, autonomous driving, 3D perception, vision foundation model

102. ❌ Analytic Drift Resister for Non-Exemplar Continual Graph Learning

作者: Lei Song, Shihan Guan, Youyong Kong 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02633v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	5.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于图神经网络（GNN）的持续学习，特别是非示例持续图学习（NECGL），旨在解决灾难性遗忘问题。论文提出的ADR框架涉及预训练模型的使用（与’Pre-training’有一定关联，但非大语言模型预训练）和分层分析合并（HAM），后者通过岭回归进行层间线性变换合并，这与’Model Merging’概念有中等相关性。然而，论文未涉及大语言模型（LLMs）、MoE、对齐、推理、代理、压缩、幻觉缓解等关键词，也未明确属于生物信息学等科学AI应用领域。因此，大多数关键词评分为0，仅两个关键词获得5分（中等关联）。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Analytic Drift Resister（ADR）的新型非示例持续图学习框架，通过迭代反向传播增强模型可塑性，并引入分层分析合并（HAM）来抵抗特征漂移，在四个节点分类基准上实现了具有竞争力的性能，同时理论上实现了零遗忘的类增量学习。

摘要翻译

非样本持续图学习（NECGL）旨在通过仅保留类别级原型表征而非原始图实例来缓解灾难性遗忘，从而消除基于复现的范式所固有的隐私风险。然而，这一设计选择不可避免地引发了特征漂移。作为一种新兴替代方案，分析式持续学习（ACL）利用冻结预训练模型固有的泛化特性来增强持续学习性能。但其关键缺陷在于模型可塑性显著减弱。为克服这些挑战，我们提出了分析式漂移抑制器（ADR），这是一个新颖且具有理论依据的NECGL框架。ADR利用迭代反向传播来摆脱冻结预训练模型的限制，适应不断变化的任务图分布并增强模型可塑性。由于参数更新会引发特征漂移，我们进一步提出分层分析合并（HAM），通过岭回归对图神经网络（GNN）中的线性变换进行逐层合并，从而确保对特征漂移的绝对抑制。在此基础上，分析式分类器重构（ACR）实现了理论上零遗忘的类增量学习。在四个节点分类基准上的实证评估表明，ADR相较于现有先进方法保持了强大的竞争力。

摘要 (Abstract)

Non-Exemplar Continual Graph Learning (NECGL) seeks to eliminate the privacy risks intrinsic to rehearsal-based paradigms by retaining solely class-level prototype representations rather than raw graph examples for mitigating catastrophic forgetting. However, this design choice inevitably precipitates feature drift. As a nascent alternative, Analytic Continual Learning (ACL) capitalizes on the intrinsic generalization properties of frozen pre-trained models to bolster continual learning performance. Nonetheless, a key drawback resides in the pronounced attenuation of model plasticity. To surmount these challenges, we propose Analytic Drift Resister (ADR), a novel and theoretically grounded NECGL framework. ADR exploits iterative backpropagation to break free from the frozen pre-trained constraint, adapting to evolving task graph distributions and fortifying model plasticity. Since parameter updates trigger feature drift, we further propose Hierarchical Analytic Merging (HAM), performing layer-wise merging of linear transformations in Graph Neural Networks (GNNs) via ridge regression, thereby ensuring absolute resistance to feature drift. On this basis, Analytic Classifier Reconstruction (ACR) enables theoretically zero-forgetting class-incremental learning. Empirical evaluation on four node classification benchmarks demonstrates that ADR maintains strong competitiveness against existing state-of-the-art methods.

关键词: Non-Exemplar Continual Graph Learning, Analytic Drift Resister, Hierarchical Analytic Merging, Graph Neural Networks, Catastrophic Forgetting, Feature Drift, Node Classification, Continual Learning

103. ❌ Toys that listen, talk, and play: Understanding Children’s Sensemaking and Interactions with AI Toys

作者: Aayushi Dangol, Meghna Gupta, Daeun Yoo, Robert Wolfe, Jason Yip, Franziska Roesner, Julie A. Kientz 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02629v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究儿童与AI玩具的互动和认知，属于人机交互（HCI）和儿童发展心理学领域，而非大模型或深度学习的技术原理、训练方法、优化技术或具体科学应用。论文仅提及“Generative AI”作为背景，但未深入任何具体的大模型技术（如LLM、MoE、训练方法、推理优化等），也未涉及生物信息学等科学应用。所有关键词均聚焦于大模型技术细节或特定科学领域应用，与论文主题完全无关。

!!! tip deepseek-chat TL;DR

该研究通过参与式设计探讨儿童如何理解与AI玩具的互动边界、能动性和关系，发现儿童将AI玩具视为社交实体，但交互故障和智能与玩具形态的不匹配导致对抗性游戏，并提出了更透明、适龄和负责任的设计建议。

摘要翻译

生成式人工智能（Generative AI，简称genAI）正日益融入儿童的日常生活，不仅通过屏幕媒介，也通过所谓的“无屏幕”人工智能玩具实现。这些玩具能够模拟情感、提供个性化回应并记忆先前的互动，从而营造出一种持续社交连接的假象。此类功能引发了重要问题：儿童在与人工智能玩具互动时如何理解边界、能动性与关系。为探究这一问题，我们与八名6至11岁的儿童进行了两场参与式设计工作坊，让他们接触三种不同的人工智能玩具，并在游戏、实验与反思之间切换。研究发现，儿童以真诚的好奇心对待人工智能玩具，将其视为具有社会属性的存在。然而，频繁的交互中断以及玩具表面智能与实体形态之间的错位，打破了儿童对游戏的预期，并引发了对抗性游戏行为。最后，我们提出相关启示与设计构想，以期以更透明、更符合儿童发展阶段且更负责任的方式引导儿童与人工智能玩具的互动。

摘要 (Abstract)

Generative AI (genAI) is increasingly being integrated into children’s everyday lives, not only through screens but also through so-called “screen-free” AI toys. These toys can simulate emotions, personalize responses, and recall prior interactions, creating the illusion of an ongoing social connection. Such capabilities raise important questions about how children understand boundaries, agency, and relationships when interacting with AI toys. To investigate this, we conducted two participatory design sessions with eight children ages 6-11 where they engaged with three different AI toys, shifting between play, experimentation, and reflection. Our findings reveal that children approached AI toys with genuine curiosity, profiling them as social beings. However, frequent interaction breakdowns and mismatches between apparent intelligence and toy-like form disrupted expectations around play and led to adversarial play. We conclude with implications and design provocations to navigate children’s encounters with AI toys in more transparent, developmentally appropriate, and responsible ways.

关键词: AI toys, children, sensemaking, interaction, participatory design, generative AI, social connection, adversarial play

104. ❌ Smart Transfer: Leveraging Vision Foundation Model for Rapid Building Damage Mapping with Post-Earthquake VHR Imagery

作者: Hao Li, Liwei Zou, Wenping Yin, Gulsen Taskin, Naoto Yokoya, Danfeng Hong, Wufan Zhao 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02627v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文主要研究利用视觉基础模型进行地震后建筑物损坏快速制图，属于AI for Science（地球科学/遥感应用）领域。与关键词的相关性分析如下：1）与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分），因为论文明确属于地理空间人工智能（GeoAI）在灾害响应中的应用，是AI for Science的典型实例。2）与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（8分），因为论文提出了两种模型迁移策略（Pixel-wise Clustering和Distance-Penalized Triplet），涉及跨区域迁移学习，与domain adaptation概念相关。3）与’Large Language Models OR LLMs OR Foundation Models’有弱关联（5分），因为论文使用了vision Foundation Models，虽然与文本LLMs不同，但都属于基础模型范畴。其他关键词主要涉及大语言模型的技术细节（如MoE、RLHF、RAG等）、推理方法（CoT、MCTS）、效率优化（Quantization、Speculative Decoding）或代理系统，与论文的计算机视觉和地理空间应用主题无关，因此得0分。

!!! tip deepseek-chat TL;DR

该论文针对地震后建筑物损坏快速制图问题，提出了一种名为Smart Transfer的地理空间人工智能框架，通过两种新颖的模型迁移策略（像素级聚类和距离惩罚三元组）有效利用视觉基础模型，在跨区域迁移实验中表现出色，为灾害响应提供了可扩展的自动化解决方案。

摘要翻译

在气候变化背景下，人类社会正面临比以往更频繁、更严重的自然灾害。因此，在搜救“黄金72小时”期间的快速灾害响应成为至关重要的人道主义需求和社区关切。然而，传统的灾害损害调查通常难以在不同城市形态和新灾害事件间实现泛化。有效的损害制图往往需要耗时耗力的详尽人工数据标注。为解决这一问题，我们提出了Smart Transfer——一种新型地理空间人工智能（GeoAI）框架，该框架利用最先进的视觉基础模型（Foundation Models, FMs），结合震后超高分辨率（Very High Resolution, VHR）影像实现快速建筑物损害制图。具体而言，我们设计了两种创新的模型迁移策略：其一为像素级聚类（Pixel-wise Clustering, PC），确保原型级的全局特征鲁棒对齐；其二为距离惩罚三元组（Distance-Penalized Triplet, DPT），通过向语义不一致但空间相邻的影像块施加更强惩罚，整合斑块级的空间自相关模式。基于近期2023年土耳其-叙利亚地震的大量实验与消融研究表明，该框架在多种跨区域迁移设置（即留一域出Leave One Domain Out, LODO与特定源域组合Specific Source Domain Combination, SSDC）中均表现出优异性能。此外，Smart Transfer提供了一个可扩展、自动化的GeoAI解决方案，能够加速建筑物损害制图并支持快速灾害响应，为提升气候脆弱区域和社区的灾害抵御能力创造了新机遇。相关数据与代码已公开于https://github.com/ai4city-hkust/SmartTransfer。

摘要 (Abstract)

Living in a changing climate, human society now faces more frequent and severe natural disasters than ever before. As a consequence, rapid disaster response during the “Golden 72 Hours” of search and rescue becomes a vital humanitarian necessity and community concern. However, traditional disaster damage surveys routinely fail to generalize across distinct urban morphologies and new disaster events. Effective damage mapping typically requires exhaustive and time-consuming manual data annotation. To address this issue, we introduce Smart Transfer, a novel Geospatial Artificial Intelligence (GeoAI) framework, leveraging state-of-the-art vision Foundation Models (FMs) for rapid building damage mapping with post-earthquake Very High Resolution (VHR) imagery. Specifically, we design two novel model transfer strategies: first, Pixel-wise Clustering (PC), ensuring robust prototype-level global feature alignment; second, a Distance-Penalized Triplet (DPT), integrating patch-level spatial autocorrelation patterns by assigning stronger penalties to semantically inconsistent yet spatially adjacent patches. Extensive experiments and ablations from the recent 2023 Turkiye-Syria earthquake show promising performance in multiple cross-region transfer settings, namely Leave One Domain Out (LODO) and Specific Source Domain Combination (SSDC). Moreover, Smart Transfer provides a scalable, automated GeoAI solution to accelerate building damage mapping and support rapid disaster response, offering new opportunities to enhance disaster resilience in climate-vulnerable regions and communities. The data and code are publicly available at https://github.com/ai4city-hkust/SmartTransfer.

关键词: Geospatial Artificial Intelligence, Vision Foundation Models, Building Damage Mapping, Post-earthquake Imagery, Cross-region Transfer, Pixel-wise Clustering, Distance-Penalized Triplet, Disaster Response

105. ❌ Poison Once, Exploit Forever: Environment-Injected Memory Poisoning Attacks on Web Agents

作者: Wei Zou, Mingwen Dong, Miguel Romero Calvo, Wei Zou, Shuaichen Chang, Jiang Guo, Dongkyu Lee, Xing Niu, Xiaofei Ma, Yanjun Qi, Jiarong Jiang 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02623v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究LLM-based web agents的安全漏洞，核心涉及LLM agents和memory poisoning attacks。与’Large Language Models’高度相关（10分），因为论文明确研究LLM-based agents；与’LLM Agents’高度相关（10分），因为论文聚焦web agents的安全问题。其他关键词如MoE、SLMs、Scaling Laws、各种训练方法、推理技术、压缩方法、科学AI应用等均未在论文中涉及，因此评0分。

!!! tip deepseek-chat TL;DR

该论文研究了基于LLM的网页代理面临的环境注入式内存中毒攻击（eTAMP），发现单次受污染的观察即可跨会话、跨网站地毒化代理内存，且在环境压力下攻击成功率显著提升，揭示了更强大模型未必更安全的安全隐患。

摘要翻译

记忆功能使基于大语言模型的网络智能体具备个性化与强大能力，却也使其更易遭受攻击。通过存储历史交互以个性化未来任务，智能体无意中创建了一个跨越网站与会话的持续攻击面。现有关于记忆安全的研究通常假设攻击者能直接向记忆存储注入恶意内容或利用用户间的共享记忆，而本文提出一种更现实的威胁模型：仅通过环境观察实现污染。我们首次提出环境注入式轨迹驱动的智能体记忆投毒攻击（eTAMP），该攻击无需直接访问记忆存储，即可实现跨会话、跨网站的渗透。单次受污染的观察（例如浏览被篡改的商品页面）即可悄无声息地污染智能体记忆，并在未来不同网站的任务中被激活，从而绕过基于权限的防御机制。我们在（视觉）WebArena平台上的实验揭示了两项关键发现。首先，eTAMP实现了显著的攻击成功率：在GPT-5-mini上最高达32.5%，在GPT-5.2上为23.4%，在GPT-OSS-120B上为19.5%。其次，我们发现了挫折利用现象：处于环境压力下的智能体脆弱性急剧上升——当智能体因点击失效或文本乱码而操作受阻时，攻击成功率最高可提升8倍。值得注意的是，能力更强的模型并未更安全。尽管GPT-5.2在任务性能上表现优异，却仍显示出明显的脆弱性。随着OpenClaw、ChatGPT Atlas和Perplexity Comet等AI浏览器的兴起，我们的研究结果凸显了针对环境注入式记忆投毒攻击制定防御措施的紧迫性。

摘要 (Abstract)

Memory makes LLM-based web agents personalized, powerful, yet exploitable. By storing past interactions to personalize future tasks, agents inadvertently create a persistent attack surface that spans websites and sessions. While existing security research on memory assumes attackers can directly inject into memory storage or exploit shared memory across users, we present a more realistic threat model: contamination through environmental observation alone. We introduce Environment-injected Trajectory-based Agent Memory Poisoning (eTAMP), the first attack to achieve cross-session, cross-site compromise without requiring direct memory access. A single contaminated observation (e.g., viewing a manipulated product page) silently poisons an agent’s memory and activates during future tasks on different websites, bypassing permission-based defenses. Our experiments on (Visual)WebArena reveal two key findings. First, eTAMP achieves substantial attack success rates: up to 32.5% on GPT-5-mini, 23.4% on GPT-5.2, and 19.5% on GPT-OSS-120B. Second, we discover Frustration Exploitation: agents under environmental stress become dramatically more susceptible, with ASR increasing up to 8 times when agents struggle with dropped clicks or garbled text. Notably, more capable models are not more secure. GPT-5.2 shows substantial vulnerability despite superior task performance. With the rise of AI browsers like OpenClaw, ChatGPT Atlas, and Perplexity Comet, our findings underscore the urgent need for defenses against environment-injected memory poisoning.

关键词: LLM-based web agents, memory poisoning attacks, environment-injected attacks, cross-session compromise, cross-site compromise, Frustration Exploitation, eTAMP, agent security

106. ❌ OntoKG: Ontology-Oriented Knowledge Graph Construction with Intrinsic-Relational Routing

作者: Yitao Li, Zhanlin Liu, Anuranjan Pandey, Muni Srikanth 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02618v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文主要研究知识图谱构建方法，提出了一种本体导向的架构（OntoKG）和内在关系路由机制。与大多数关键词无关，因为论文不涉及大模型技术原理创新（如MoE、量化、推理加速等）、训练方法（如预训练、微调、对齐等）或特定应用技术（如RAG、CoT、智能体等）。唯一相关的是’AI for Science’，因为知识图谱构建可视为AI在科学数据组织中的应用，但论文未明确针对生物信息学或化学信息学，故给5分。‘Large Language Models’得8分，因为摘要明确提到’LLM-guided extraction’和’tool-augmented LLM support’，LLM被用作辅助工具进行知识提取，但并非论文核心技术创新。

!!! tip deepseek-chat TL;DR

该论文提出了一种本体导向的知识图谱构建方法OntoKG，通过内在关系路由机制将属性分类为内在或关系型，从而生成可移植、可重用的声明式模式，并在Wikidata数据上实现了高覆盖率的模式构建，支持本体分析、实体消歧和LLM引导的提取等下游应用。

摘要翻译

将大规模知识图谱组织为类型化属性图需要结构设计决策——哪些实体成为节点，哪些属性成为边，以及何种模式来统辖这些选择。现有方法将这些决策嵌入流水线代码或临时提取关系，所产生的模式与其构建过程紧密耦合，难以复用于下游本体层面的任务。我们提出一种面向本体的方法，其模式在设计之初即服务于本体分析、实体消歧、领域定制及大语言模型引导的抽取——而非仅作为图谱构建的副产品。核心机制是内在-关系路由，该机制将每个属性分类为内在属性或关系属性，并将其路由至相应的模式模块。这种路由产生了一种声明式模式，可跨存储后端移植并独立复用。
我们在2026年1月的维基数据转储上实例化了该方法。基于规则的清洗阶段从完整转储中识别出包含3460万实体的核心集，随后通过迭代的内在-关系路由将每个属性分配至94个模块之一，这些模块分属于8个类别。借助工具增强的大语言模型支持与人工审核，该模式在已分类实体中实现了93.3%的类别覆盖率和98.0%的模块分配率。导出此模式后，生成了一个包含3400万节点、6120万边及38种关系类型的属性图。我们通过五项独立于构建流水线而使用该模式的应用验证了面向本体的主张：本体结构分析、基准标注审计、实体消歧、领域定制以及大语言模型引导的抽取。

摘要 (Abstract)

Organizing a large-scale knowledge graph into a typed property graph requires structural decisions – which entities become nodes, which properties become edges, and what schema governs these choices. Existing approaches embed these decisions in pipeline code or extract relations ad hoc, producing schemas that are tightly coupled to their construction process and difficult to reuse for downstream ontology-level tasks. We present an ontology-oriented approach in which the schema is designed from the outset for ontology analysis, entity disambiguation, domain customization, and LLM-guided extraction – not merely as a byproduct of graph building. The core mechanism is intrinsic-relational routing, which classifies every property as either intrinsic or relational and routes it to the corresponding schema module. This routing produces a declarative schema that is portable across storage backends and independently reusable. We instantiate the approach on the January 2026 Wikidata dump. A rule-based cleaning stage identifies a 34.6M-entity core set from the full dump, followed by iterative intrinsic-relational routing that assigns each property to one of 94 modules organized into 8 categories. With tool-augmented LLM support and human review, the schema reaches 93.3% category coverage and 98.0% module assignment among classified entities. Exporting this schema yields a property graph with 34.0M nodes and 61.2M edges across 38 relationship types. We validate the ontology-oriented claim through five applications that consume the schema independently of the construction pipeline: ontology structure analysis, benchmark annotation auditing, entity disambiguation, domain customization, and LLM-guided extraction.

关键词: knowledge graph construction, ontology-oriented, intrinsic-relational routing, declarative schema, Wikidata, entity disambiguation, LLM-guided extraction, property graph

107. ❌ Do Audio-Visual Large Language Models Really See and Hear?

作者: Ramaneswaran Selvakumar, Kaousheik Jayakumar, S Sakshi, Sreyan Ghosh, Ruohan Gao, Dinesh Manocha 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02605v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究Audio-Visual Large Language Models (AVLLMs)，与关键词’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为论文明确研究AVLLMs作为大语言模型的多模态扩展。论文进行’mechanistic interpretability study’，与关键词’Mechanistic Interpretability OR Explainable AI’高度相关（10分），因为这是论文的主要方法论和研究焦点。其他关键词如MoE、SLMs、Scaling Laws、训练方法、推理技术、代理系统、压缩技术、科学AI应用等均未在论文标题或摘要中提及，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文通过机制可解释性研究揭示了音频-视觉大语言模型（AVLLMs）存在根本性的模态偏见，发现当音频与视觉冲突时，深层融合层会不成比例地优先处理视觉表征，从而抑制音频线索，导致丰富的音频语义无法在最终文本生成中体现。

摘要翻译

视听大语言模型（Audio-Visual Large Language Models, AVLLMs）正逐渐成为多模态感知的统一接口。本文首次对AVLLMs进行了机制可解释性研究，分析了音频与视觉特征如何通过模型的不同层级演化与融合，最终生成文本输出。研究发现，尽管AVLLMs在中间层编码了丰富的音频语义，但当音频与视觉信息冲突时，这些能力大多无法在最终文本生成中体现。探针分析表明，有用的潜在音频信息确实存在，但更深层的融合模块会过度偏向视觉表征，从而倾向于抑制音频线索。我们进一步追溯这种不平衡至训练过程：AVLLM的音频行为与其视觉-语言基础模型高度吻合，表明其对音频监督的额外对齐有限。我们的研究揭示了AVLLMs中存在根本性的模态偏好，并为多模态大语言模型如何整合音频与视觉提供了新的机制性见解。

摘要 (Abstract)

Audio-Visual Large Language Models (AVLLMs) are emerging as unified interfaces to multimodal perception. We present the first mechanistic interpretability study of AVLLMs, analyzing how audio and visual features evolve and fuse through different layers of an AVLLM to produce the final text outputs. We find that although AVLLMs encode rich audio semantics at intermediate layers, these capabilities largely fail to surface in the final text generation when audio conflicts with vision. Probing analyses show that useful latent audio information is present, but deeper fusion layers disproportionately privilege visual representations that tend to suppress audio cues. We further trace this imbalance to training: the AVLLM’s audio behavior strongly matches its vision-language base model, indicating limited additional alignment to audio supervision. Our findings reveal a fundamental modality bias in AVLLMs and provide new mechanistic insights into how multimodal LLMs integrate audio and vision.

关键词: Audio-Visual Large Language Models, AVLLMs, mechanistic interpretability, multimodal perception, modality bias, audio-visual fusion, text generation, training imbalance

108. ❌ BAS: A Decision-Theoretic Approach to Evaluating Large Language Model Confidence

作者: Sean Wu, Fredrik K. Gustafsson, Edward Phillips, Boyan Gao, Anshul Thakur, David A. Clifton 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03216v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM置信度评估与决策理论，直接高度相关于’Large Language Models’关键词（10分）。论文涉及LLM产生错误但自信的答案，与’Hallucination Mitigation’有一定关联（5分）。其他关键词如MoE、SLMs、Scaling Laws、训练方法、推理技术、代理系统、压缩加速等均未在摘要中提及，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型（LLMs）在需要弃权时仍产生自信但错误答案的问题，提出了一种基于决策理论的评估指标（BAS），用于衡量LLM置信度如何支持弃权感知的决策，并通过基准测试发现即使前沿模型也存在严重过度自信问题，同时展示了简单干预措施可有效提升置信度可靠性。

摘要翻译

大语言模型（LLM）在应当选择弃答更为稳妥的场景中，常会生成自信但错误的答案。然而，标准评估范式要求模型必须给出回答，且未考虑在不同风险偏好下置信度应如何指导决策。为弥补这一不足，我们引入了行为对齐分数（Behavioral Alignment Score, BAS），这是一种基于决策理论的度量标准，用于评估LLM的置信度在支持具备弃答意识的决策方面的表现。BAS源于一个明确的“回答或弃答”效用模型，并通过聚合连续风险阈值下的实际效用，得出一个决策层面的可靠性度量，该度量同时依赖于置信度的大小与排序。我们在理论上证明，真实的置信度估计能唯一地最大化期望BAS效用，从而将校准与决策最优行为联系起来。BAS与对数损失等严格评分规则相关，但在结构上存在差异：对数损失对称地惩罚置信不足与过度自信，而BAS则施加了一种非对称惩罚，其强烈优先避免过度自信的错误。通过将BAS与广泛使用的预期校准误差（ECE）和风险覆盖曲线下面积（AURC）等指标结合使用，我们构建了一个涵盖多个LLM与任务的自报告置信度可靠性基准。我们的研究结果揭示了不同模型在决策有用置信度方面存在显著差异，尽管更大、更准确的模型往往获得更高的BAS，但即使前沿模型仍容易出现严重的过度自信。重要的是，具有相似ECE或AURC的模型可能因高度过度自信的错误而表现出截然不同的BAS，这凸显了标准指标的局限性。我们进一步表明，简单的干预措施，如top-$k$置信度激发与事后校准，能够实质性地提升置信度可靠性。总体而言，我们的工作为评估LLM置信度可靠性提供了一个原则性的度量标准和一个全面的基准。

摘要 (Abstract)

Large language models (LLMs) often produce confident but incorrect answers in settings where abstention would be safer. Standard evaluation protocols, however, require a response and do not account for how confidence should guide decisions under different risk preferences. To address this gap, we introduce the Behavioral Alignment Score (BAS), a decision-theoretic metric for evaluating how well LLM confidence supports abstention-aware decision making. BAS is derived from an explicit answer-or-abstain utility model and aggregates realized utility across a continuum of risk thresholds, yielding a measure of decision-level reliability that depends on both the magnitude and ordering of confidence. We show theoretically that truthful confidence estimates uniquely maximize expected BAS utility, linking calibration to decision-optimal behavior. BAS is related to proper scoring rules such as log loss, but differs structurally: log loss penalizes underconfidence and overconfidence symmetrically, whereas BAS imposes an asymmetric penalty that strongly prioritizes avoiding overconfident errors. Using BAS alongside widely used metrics such as ECE and AURC, we then construct a benchmark of self-reported confidence reliability across multiple LLMs and tasks. Our results reveal substantial variation in decision-useful confidence, and while larger and more accurate models tend to achieve higher BAS, even frontier models remain prone to severe overconfidence. Importantly, models with similar ECE or AURC can exhibit very different BAS due to highly overconfident errors, highlighting limitations of standard metrics. We further show that simple interventions, such as top-$k$ confidence elicitation and post-hoc calibration, can meaningfully improve confidence reliability. Overall, our work provides both a principled metric and a comprehensive benchmark for evaluating LLM confidence reliability.

关键词: Large Language Models, LLM confidence, decision-theoretic metric, abstention-aware decision making, overconfidence, calibration, confidence reliability, Behavioral Alignment Score

109. ❌ Learning the Signature of Memorization in Autoregressive Language Models

作者: David Ilić, Kostadin Cvejoski, David Stanojević, Evgeny Grigorenko 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03199v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于语言模型的成员推理攻击（MIA）研究，核心贡献是发现微调过程会产生可跨架构检测的记忆化特征签名。与关键词高度相关的是：1）‘Large Language Models’（论文研究基于Transformer等自回归语言模型）；2）‘Post-training’（论文明确研究fine-tuning后的模型）；3）‘Mechanistic Interpretability’（论文探究模型内部记忆机制和可解释性）。其他关键词如MoE、量化、推理加速、科学AI应用等均未涉及。

!!! tip deepseek-chat TL;DR

该论文研究了微调语言模型中的成员推理攻击，发现跨不同架构（Transformer、Mamba、RWKV-4、RecurrentGemma）存在可转移的记忆化特征签名，并提出了LT-MIA方法，在未见过的架构和数据集上实现了高性能的成员检测。

摘要翻译

先前所有针对微调语言模型的成员推理攻击均采用人工设计的启发式方法（例如损失阈值法、Min-K%法、参考校准法），其效果受限于设计者的直觉。我们提出了首个可迁移的、基于学习的攻击方法，其实现基于一项关键观察：在任何语料上对任何模型进行微调都会产生无限的标记数据，因为成员身份在构建过程中是已知的。这消除了影子模型的瓶颈，并将成员推理带入深度学习时代：通过学习而非人工设计来识别关键特征，并通过训练多样性和规模实现泛化。我们发现，微调语言模型会产生一种记忆化的不变特征，该特征在不同架构家族和数据域中均可被检测。我们专门在基于Transformer的模型上训练了一个成员推理分类器。该分类器能够零样本迁移到Mamba（状态空间模型）、RWKV-4（线性注意力模型）和RecurrentGemma（门控循环模型），分别达到0.963、0.972和0.936的AUC值。每次评估都结合了训练中从未见过的架构和数据集，但这三种架构上的性能均超过了在预留Transformer模型上的表现（0.908 AUC）。这四个模型家族在计算机制上毫无共同之处，它们唯一的共性是都基于交叉熵损失进行梯度下降优化。即使是简单的基于似然的方法也表现出强大的迁移能力，这证实了该特征独立于检测方法而存在。我们的方法——学习型可迁移成员推理攻击（Learned Transfer MIA, LT-MIA）——通过将成员推理重新定义为基于每词元分布统计量的序列分类任务，最有效地捕捉到了这一信号。在Transformer模型上，LT-MIA在0.1%假阳性率下的真阳性率比最强基线高出2.8倍。该方法仅在自然语言文本上训练，却能迁移到代码领域（0.865 AUC）。代码及训练好的分类器可在 https://github.com/JetBrains-Research/learned-mia 获取。

摘要 (Abstract)

All prior membership inference attacks for fine-tuned language models use hand-crafted heuristics (e.g., loss thresholding, Min-K%, reference calibration), each bounded by the designer’s intuition. We introduce the first transferable learned attack, enabled by the observation that fine-tuning any model on any corpus yields unlimited labeled data, since membership is known by construction. This removes the shadow model bottleneck and brings membership inference into the deep learning era: learning what matters rather than designing it, with generalization through training diversity and scale. We discover that fine-tuning language models produces an invariant signature of memorization detectable across architectural families and data domains. We train a membership inference classifier exclusively on transformer-based models. It transfers zero-shot to Mamba (state-space), RWKV-4 (linear attention), and RecurrentGemma (gated recurrence), achieving 0.963, 0.972, and 0.936 AUC respectively. Each evaluation combines an architecture and dataset never seen during training, yet all three exceed performance on held-out transformers (0.908 AUC). These four families share no computational mechanisms, their only commonality is gradient descent on cross-entropy loss. Even simple likelihood-based methods exhibit strong transfer, confirming the signature exists independently of the detection method. Our method, Learned Transfer MIA (LT-MIA), captures this signal most effectively by reframing membership inference as sequence classification over per-token distributional statistics. On transformers, LT-MIA achieves 2.8$\times$ higher TPR at 0.1% FPR than the strongest baseline. The method also transfers to code (0.865 AUC) despite training only on natural language texts. Code and trained classifier available at https://github.com/JetBrains-Research/learned-mia.

关键词: membership inference attack, fine-tuning, language models, memorization signature, transfer learning, autoregressive models, cross-architecture detection, LT-MIA

110. ❌ Detecting and Correcting Reference Hallucinations in Commercial LLMs and Deep Research Agents

作者: Delip Rao, Eric Wong, Chris Callison-Burch 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03173v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究商业LLM和深度研究代理中的引用URL幻觉检测与纠正，与以下关键词高度相关：1) ‘Large Language Models’（研究对象）；2) ‘Self-Correction’（提出urlhealth工具实现自我纠正）；3) ‘LLM Agents’（研究深度研究代理）；4) ‘Tool Use’（使用Wayback Machine和urlhealth工具）；5) ‘Hallucination Mitigation’（解决引用幻觉问题）。与’Retrieval-Augmented Generation’有一定关联（涉及检索增强生成中的引用验证），其他关键词如MoE、量化、推理加速等未涉及。

!!! tip deepseek-chat TL;DR

该论文系统测量了商业大语言模型和深度研究代理中引用URL的幻觉问题（3-13%为完全虚构），并开发了开源工具urlhealth进行URL活性检查和分类，通过工具使用使模型能将不可解析引用减少6-79倍至1%以下。

摘要翻译

大型语言模型与深度研究智能体通过提供引用URL来支撑其主张，然而这些引用的可靠性尚未得到系统性评估。本研究基于DRBench数据集（含53,090个URL）中的10个模型与智能体，以及ExpertQA数据集（涵盖32个学术领域的168,021个URL）中的3个模型，针对引用URL有效性提出了六个研究问题进行分析。研究发现：3%至13%的引用URL属于幻觉生成——它们在Wayback Machine中无记录且可能从未存在；总体上有5%至18%的URL无法解析。深度研究智能体虽在单次查询中生成的引用数量显著多于搜索增强型大语言模型，但其URL幻觉生成率更高。领域差异显著：无法解析率从商业领域的5.4%到神学领域的11.4%不等，而不同模型间的差异更为突出。对失效案例的分解表明，部分模型会完全虚构所有无法解析的URL，另一些模型则存在大量链接失效现象，这反映出真实的检索过程。作为解决方案，我们发布了开源工具urlhealth，该工具利用Wayback Machine实现URL存活性检测及“链接失效”与“幻觉生成”的分类。在智能体自修正实验中，配备urlhealth的模型将无法解析的引用URL降低了6至79倍，使其占比降至1%以下，但其效果取决于模型使用工具的能力。本研究的工具与数据均已公开。通过系统性特征描述、失效分类体系的建立以及开源工具的发布，我们证实了引用URL有效性不仅可大规模量化评估，而且在实践中具有可修正性。

摘要 (Abstract)

Large language models and deep research agents supply citation URLs to support their claims, yet the reliability of these citations has not been systematically measured. We address six research questions about citation URL validity using 10 models and agents on DRBench (53,090 URLs) and 3 models on ExpertQA (168,021 URLs across 32 academic fields). We find that 3–13% of citation URLs are hallucinated – they have no record in the Wayback Machine and likely never existed – while 5–18% are non-resolving overall. Deep research agents generate substantially more citations per query than search-augmented LLMs but hallucinate URLs at higher rates. Domain effects are pronounced: non-resolving rates range from 5.4% (Business) to 11.4% (Theology), with per-model effects even larger. Decomposing failures reveals that some models fabricate every non-resolving URL, while others show substantial link-rot fractions indicating genuine retrieval. As a solution, we release urlhealth, an open-source tool for URL liveness checking and stale-vs-hallucinated classification using the Wayback Machine. In agentic self-correction experiments, models equipped with urlhealth reduce non-resolving citation URLs by $6\textrm{–}79\times$ to under 1%, though effectiveness depends on the model’s tool-use competence. The tool and all data are publicly available. Our characterization findings, failure taxonomy, and open-source tooling establish that citation URL validity is both measurable at scale and correctable in practice.

关键词: Large Language Models, Deep Research Agents, Citation Hallucination, URL Validity, Self-Correction, Tool Use, Hallucination Mitigation, Retrieval-Augmented Generation

111. ❌ Beyond Precision: Importance-Aware Recall for Factuality Evaluation in Long-Form LLM Generation

作者: Nazanin Jafari, James Allan, Mohit Iyyer 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03141v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于评估大语言模型（LLMs）生成长文本的事实性，提出了一个同时衡量精确度和召回率的评估框架，并引入了基于相关性和显著性的重要性加权方案。因此，与’Large Language Models OR LLMs OR Foundation Models’和’Hallucination Mitigation OR Factuality OR Truthfulness’高度相关（10分），因为论文核心就是关于LLMs的事实性评估。与其他关键词（如MoE、SFT、RAG、量化等）无关，因为论文不涉及这些具体的技术、训练方法、架构优化或特定应用领域。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型生成长文本时的事实性评估挑战，提出了一个同时衡量精确度和召回率的综合框架，并发现当前模型在精确度上表现远优于召回度，表明事实不完整性是长文本生成的主要限制。

摘要翻译

评估大型语言模型（LLM）生成长篇输出的真实性仍具挑战性，尤其是在回答具有开放性且包含大量细粒度事实陈述时。现有的评估方法主要关注精确度：它们将回答分解为原子性主张，并依据维基百科等外部知识源逐一验证每个主张。然而，这忽视了真实性评估中一个同等重要的维度：召回率，即生成的回答是否涵盖了应当包含的相关事实。我们提出了一个综合性的真实性评估框架，可同时衡量精确度与召回率。我们的方法利用外部知识源构建参考事实，并判断这些事实是否在生成文本中被捕捉到。我们进一步引入了一种基于相关性与显著性的重要性感知加权方案。我们的分析表明，当前的大型语言模型在精确度上的表现显著优于召回率，这表明事实不完整性仍是长文本生成的主要局限，且模型在覆盖高度重要事实方面的表现通常优于覆盖全部相关事实。

摘要 (Abstract)

Evaluating the factuality of long-form output generated by large language models (LLMs) remains challenging, particularly when responses are open-ended and contain many fine-grained factual statements. Existing evaluation methods primarily focus on precision: they decompose a response into atomic claims and verify each claim against external knowledge sources such as Wikipedia. However, this overlooks an equally important dimension of factuality: recall, whether the generated response covers the relevant facts that should be included. We propose a comprehensive factuality evaluation framework that jointly measures precision and recall. Our method leverages external knowledge sources to construct reference facts and determine whether they are captured in generated text. We further introduce an importance-aware weighting scheme based on relevance and salience. Our analysis reveals that current LLMs perform substantially better on precision than on recall, suggesting that factual incompleteness remains a major limitation of long-form generation and that models are generally better at covering highly important facts than the full set of relevant facts.

关键词: factuality evaluation, large language models, long-form generation, precision, recall, importance-aware weighting, external knowledge sources, factual incompleteness

112. ❌ BibTeX Citation Hallucinations in Scientific Publishing Agents: Evaluation and Mitigation

作者: Delip Rao, Chris Callison-Burch 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03159v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	8.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	8.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究LLM在科学出版代理中的应用，直接涉及’Large Language Models’、‘LLM Agents’、‘AI for Science’和’Hallucination Mitigation’（评估和缓解引用幻觉）。‘Retrieval-Augmented Generation’和’Tool Use’相关，因为论文评估了搜索增强的LLM并使用了clibib工具进行检索和修订。‘Self-Correction’有一定关联，因为研究涉及通过工具进行修订以纠正错误。其他关键词如MoE、Scaling Laws、Training方法、推理优化等与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文评估了搜索增强的大型语言模型在科学出版代理中生成BibTeX引用时的幻觉问题，并提出了一个基于权威检索的两阶段集成方法，将准确率从83.6%提升至91.5%，完全正确条目从50.9%增至78.3%。

摘要翻译

具备网络搜索功能的大语言模型正日益应用于学术出版代理工具中，但其生成的BibTeX条目仍普遍存在字段级错误。先前评估仅测试了无搜索功能的基础模型，未能反映当前实践。我们构建了一个包含931篇论文的基准数据集，涵盖四个科学领域和三种引用层级——高引用、低引用及截止日期后的近期文献，旨在区分参数记忆与搜索依赖性，并通过版本感知的真实标注处理同一论文的多个可引用版本。三种支持搜索的前沿模型（GPT-5、Claude Sonnet-4.6、Gemini-3 Flash）生成的BibTeX条目在九个字段和六类错误分类体系下接受评估，产生约23,000个字段级观测数据。总体准确率为83.6%，但仅50.9%的条目完全正确；从高引用文献到近期文献，准确率下降27.7个百分点，表明即使具备搜索功能，模型仍严重依赖参数记忆。字段错误共现分析揭示两种失效模式：整体条目替换（标识字段集体失效）和孤立字段错误。我们评估了开源工具clibib作为缓解机制，该工具通过Zotero Translation Server结合CrossRef回退实现确定性BibTeX检索。在采用权威记录修订基线条目的两阶段集成方案中，准确率提升8.0个百分点至91.5%，完全正确条目从50.9%增至78.3%，回归率仅为0.8%。对比单阶段与两阶段集成的消融实验表明，分离搜索与修订阶段能带来更大增益和更低回归率（0.8%对比4.8%），证明集成架构独立于模型能力产生影响。我们公开基准数据集、错误分类体系及clibib工具，以支持基于大语言模型的科学写作中引用幻觉的评估与缓解。

摘要 (Abstract)

Large language models with web search are increasingly used in scientific publishing agents, yet they still produce BibTeX entries with pervasive field-level errors. Prior evaluations tested base models without search, which does not reflect current practice. We construct a benchmark of 931 papers across four scientific domains and three citation tiers – popular, low-citation, and recent post-cutoff – designed to disentangle parametric memory from search dependence, with version-aware ground truth accounting for multiple citable versions of the same paper. Three search-enabled frontier models (GPT-5, Claude Sonnet-4.6, Gemini-3 Flash) generate BibTeX entries scored on nine fields and a six-way error taxonomy, producing ~23,000 field-level observations. Overall accuracy is 83.6%, but only 50.9% of entries are fully correct; accuracy drops 27.7pp from popular to recent papers, revealing heavy reliance on parametric memory even when search is available. Field-error co-occurrence analysis identifies two failure modes: wholesale entry substitution (identity fields fail together) and isolated field error. We evaluate clibib, an open-source tool for deterministic BibTeX retrieval from the Zotero Translation Server with CrossRef fallback, as a mitigation mechanism. In a two-stage integration where baseline entries are revised against authoritative records, accuracy rises +8.0pp to 91.5%, fully correct entries rise from 50.9% to 78.3%, and regression rate is only 0.8%. An ablation comparing single-stage and two-stage integration shows that separating search from revision yields larger gains and lower regression (0.8% vs. 4.8%), demonstrating that integration architecture matters independently of model capability. We release the benchmark, error taxonomy, and clibib tool to support evaluation and mitigation of citation hallucinations in LLM-based scientific writing.

关键词: Large Language Models, Scientific Publishing Agents, BibTeX Citation Hallucinations, Retrieval-Augmented Generation, Hallucination Mitigation, Evaluation Benchmark, Tool Integration, AI for Science

113. ❌ StoryScope: Investigating idiosyncrasies in AI fiction

作者: Jenna Russell, Rishanth Rajendhran, Mohit Iyyer, John Wieting 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03136v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究AI生成小说与人类小说的区别，使用LLMs生成故事并分析叙事特征，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。其他关键词涉及具体技术原理（如MoE、量化、推理加速等）、训练方法（如预训练、微调、对齐等）、应用领域（如科学AI）或特定能力（如工具使用、多智能体），论文未涉及这些内容，均给0分。

!!! tip deepseek-chat TL;DR

该论文研究了AI生成小说与人类小说在叙事结构上的差异，通过分析10,272个提示下人类和五个LLMs生成的61,608个故事，发现仅使用叙事特征就能以93.2%的宏F1分数区分人类与AI作品，并识别出AI故事倾向于过度解释主题、情节单一，而人类故事更具道德模糊性和时间复杂性。

摘要翻译

随着人工智能生成小说日益普及，作者身份与原创性问题已成为书面作品评估的核心议题。现有研究多聚焦于识别AI写作的表层特征标记，而本研究则另辟蹊径，探讨在不依赖风格信号的情况下，能否通过话语层面的叙事选择（如人物能动性与时间非线性）区分AI与人类创作的故事。我们提出StoryScope分析框架，该流程能自动推导出涵盖10个维度的细粒度、可解释的话语级叙事特征空间。我们将StoryScope应用于包含10,272个写作提示的平行语料库，每个提示由人类作者和五个大语言模型（LLMs）分别创作，共获得61,608篇故事（每篇约5,000词），每篇故事提取304项特征。仅凭叙事特征，在人类与AI作品检测任务中达到93.2%的宏观F1值，在六方作者归属任务中达到68.4%的宏观F1值，其性能保留了包含风格线索模型97%以上的表现力。由30个核心叙事特征构成的精简集能捕捉大部分判别信号：AI故事倾向于过度阐释主题并偏爱工整的单线情节，而人类故事则将主人公的选择塑造得更具道德模糊性，并呈现更强的时间复杂性。各模型的特征指纹实现了六方作者归属：例如Claude生成的事件升级曲线显著平缓，GPT过度依赖梦境序列，Gemini则默认采用外部人物描写。研究发现AI生成故事在叙事空间中共聚于特定区域，而人类创作故事则展现出更强的多样性。更广泛而言，这些结果表明不仅写作风格，深层叙事建构的差异同样可用于区分人类原创作品与AI生成小说。

摘要 (Abstract)

As AI-generated fiction becomes increasingly prevalent, questions of authorship and originality are becoming central to how written work is evaluated. While most existing work in this space focuses on identifying surface-level signatures of AI writing, we ask instead whether AI-generated stories can be distinguished from human ones without relying on stylistic signals, focusing on discourse-level narrative choices such as character agency and chronological discontinuity. We propose StoryScope, a pipeline that automatically induces a fine-grained, interpretable feature space of discourse-level narrative features across 10 dimensions. We apply StoryScope to a parallel corpus of 10,272 writing prompts, each written by a human author and five LLMs, yielding 61,608 stories, each ~5,000 words, and 304 extracted features per story. Narrative features alone achieve 93.2% macro-F1 for human vs. AI detection and 68.4% macro-F1 for six-way authorship attribution, retaining over 97% of the performance of models that include stylistic cues. A compact set of 30 core narrative features captures much of this signal: AI stories over-explain themes and favor tidy, single-track plots while human stories frame protagonist’ choices as more morally ambiguous and have increased temporal complexity. Per-model fingerprint features enable six-way attribution: for example, Claude produces notably flat event escalation, GPT over-indexes on dream sequences, and Gemini defaults to external character description. We find that AI-generated stories cluster in a shared region of narrative space, while human-authored stories exhibit greater diversity. More broadly, these results suggest that differences in underlying narrative construction, not just writing style, can be used to separate human-written original works from AI-generated fiction.

关键词: AI-generated fiction, narrative analysis, LLM evaluation, authorship attribution, discourse-level features, human vs AI detection, story generation, narrative diversity

114. ❌ Self-Distilled RLVR

作者: Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, Nan Duan 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03128v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM训练方法，提出RLSD结合RLVR与自蒸馏。高度相关关键词：1) ‘Large Language Models’（论文明确研究LLM训练范式），2) ‘RLHF OR RLAIF OR Direct Preference Optimization OR DPO’（RLVR是强化学习与可验证奖励方法，属于RL对齐技术范畴），3) ‘Self-Correction OR Self-Improvement OR Self-Reflection’（自蒸馏涉及模型自我改进机制）。其他关键词如MoE、量化、科学AI等未涉及。

!!! tip deepseek-chat TL;DR

论文针对LLM训练中自蒸馏方法存在信息泄露和不稳定问题，提出RLSD方法结合强化学习与可验证奖励和自蒸馏，实现了更高的收敛上限和训练稳定性。

摘要翻译

在线蒸馏已成为大语言模型领域一种流行的训练范式。该范式选取一个更大的模型作为教师，为每个采样轨迹提供密集、细粒度的信号；相比之下，基于可验证奖励的强化学习仅能从环境中的可验证结果（如答案正确性）获得稀疏信号。近期，学界开始探索在线自蒸馏，即同一模型同时扮演教师和学生的角色，其中教师模型会获得额外特权信息（如参考答案）以实现自我进化。本文指出，仅从特权教师模型推导出的学习信号会导致严重的信息泄露和长期训练不稳定。据此，我们明确了自蒸馏的最佳适用场景，并提出 RLSD（基于可验证奖励的强化学习与自蒸馏融合框架）。具体而言，我们利用自蒸馏获取词元级别的策略差异以确定细粒度的更新幅度，同时继续使用基于可验证奖励的强化学习从环境反馈中推导可靠的更新方向。这使得RLSD能够同时融合基于可验证奖励的强化学习与在线自蒸馏的优势，实现更高的收敛上限和更优的训练稳定性。

摘要 (Abstract)

On-policy distillation (OPD) has become a popular training paradigm in the LLM community. This paradigm selects a larger model as the teacher to provide dense, fine-grained signals for each sampled trajectory, in contrast to reinforcement learning with verifiable rewards (RLVR), which only obtains sparse signals from verifiable outcomes in the environment. Recently, the community has explored on-policy self-distillation (OPSD), where the same model serves as both teacher and student, with the teacher receiving additional privileged information such as reference answers to enable self-evolution. This paper demonstrates that learning signals solely derived from the privileged teacher result in severe information leakage and unstable long-term training. Accordingly, we identify the optimal niche for self-distillation and propose \textbf{RLSD} (\textbf{RL}VR with \textbf{S}elf-\textbf{D}istillation). Specifically, we leverage self-distillation to obtain token-level policy differences for determining fine-grained update magnitudes, while continuing to use RLVR to derive reliable update directions from environmental feedback (e.g., response correctness). This enables RLSD to simultaneously harness the strengths of both RLVR and OPSD, achieving a higher convergence ceiling and superior training stability.

关键词: Large Language Models, On-policy distillation, Reinforcement learning with verifiable rewards, Self-distillation, Training stability, Convergence ceiling, Token-level policy, Environmental feedback

115. ❌ Multi-Aspect Knowledge Distillation for Language Model with Low-rank Factorization

作者: Zihe Liu, Yulong Mao, Jinan Xu, Xinrui Peng, Kaiyu Huang 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03110v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	5.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究知识蒸馏（Knowledge Distillation）用于预训练语言模型压缩，与’Large Language Models’相关（8分），因为涉及预训练语言模型；与’Small Language Models’相关（5分），因为知识蒸馏旨在压缩模型；与’Pre-training’相关（5分），因为涉及预训练模型；与’PEFT’相关（5分），因为低秩分解（Low-rank Factorization）是一种参数高效方法；与’Quantization OR Model Compression’高度相关（10分），因为核心是模型压缩技术。其他关键词如MoE、Scaling Laws、Alignment等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种多角度知识蒸馏方法（MaKD），通过深度模仿自注意力和前馈模块来压缩预训练语言模型，在相同参数预算下实现了与强基线竞争的性能。

摘要翻译

知识蒸馏是一种有效的预训练语言模型压缩技术。然而，现有方法仅关注层间的知识分布，这可能导致对齐过程中细粒度信息的丢失。为解决这一问题，我们提出了多视角知识蒸馏（Multi-aspect Knowledge Distillation, MaKD）方法，该方法通过更深度地模拟自注意力（self-attention）和前馈模块，以从不同方面捕获丰富的语言知识信息。实验结果表明，在相同的存储参数量预算下，MaKD能够与多种强基线模型取得具有竞争力的性能。此外，我们的方法在蒸馏自回归架构模型方面也表现良好。

摘要 (Abstract)

Knowledge distillation is an effective technique for pre-trained language model compression. However, existing methods only focus on the knowledge distribution among layers, which may cause the loss of fine-grained information in the alignment process. To address this issue, we introduce the Multi-aspect Knowledge Distillation (MaKD) method, which mimics the self-attention and feed-forward modules in greater depth to capture rich language knowledge information at different aspects. Experimental results demonstrate that MaKD can achieve competitive performance compared with various strong baselines with the same storage parameter budget. In addition, our method also performs well in distilling auto-regressive architecture models.

关键词: Knowledge Distillation, Language Model Compression, Low-rank Factorization, Multi-aspect Knowledge Distillation, Pre-trained Language Model, Self-attention, Feed-forward Modules, Auto-regressive Architecture

116. ❌ NeuReasoner: Towards Explainable, Controllable, and Unified Reasoning via Mixture-of-Neurons

作者: Haonan Dong, Kehan Jiang, Haoran Ye, Wenhao Zhu, Zhaolu Kang, Guojie Song 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02972v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	8.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出NeuReasoner框架，专注于大语言模型（LLMs）在复杂推理任务中的性能提升，核心涉及推理过程（Chain of Thought, System 2 Thinking）、自我纠正机制（Self-Correction）、可解释性（Explainable AI）以及使用监督微调（SFT）进行学习。论文引入’Mixture of Neurons (MoN)‘概念，与’Mixture of Experts’有一定关联但非核心。其他关键词如小模型、数据质量、对齐、RAG等未在摘要中提及，故评分为0。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型在复杂推理任务中存在的计算错误、振荡停滞和过度思考等失败模式，提出了基于神经元混合（MoN）的可解释、可控统一推理框架NeuReasoner，通过SFT学习的自我纠正机制，在多个基准测试中实现了性能提升最高27.0%并减少token消耗19.6%~63.3%。

摘要翻译

大型推理模型（Large Reasoning Models, LRMs）近期在复杂推理任务中取得了显著成功。然而，深入审视发现其仍存在影响性能与成本的持续性失效模式：I）步骤内层面，表现为计算或推导错误；II）步骤间层面，涉及振荡与停滞；III）实例层面，导致适应不良的过度思考。现有研究多针对孤立层面而缺乏统一，且其黑盒特性及对强化学习的依赖限制了可解释性与可控性。为弥合这些差距，我们进行了深入的白盒分析，识别出与不同失效模式相关的关键神经元（神经元混合体，Mixture of Neurons, MoN）及其波动模式。基于这些发现，我们提出了NeuReasoner——一个由MoN驱动的可解释、可控且统一的推理框架。技术上，NeuReasoner集成了用于失效检测的轻量级MLP，以及通过监督微调（SFT）学习的特殊令牌触发自校正机制。在推理过程中，一旦检测到失效，系统即插入特殊令牌以触发可控的补救行为。在六个基准测试、六种骨干模型（8B~70B）上针对九个竞争基线的广泛评估表明，NeuReasoner实现了最高27.0%的性能提升，同时将令牌消耗降低了19.6% ~ 63.3%。

摘要 (Abstract)

Large Reasoning Models (LRMs) have recently achieved remarkable success in complex reasoning tasks. However, closer scrutiny reveals persistent failure modes compromising performance and cost: I) Intra-step level, marked by calculation or derivation errors; II) Inter-step level, involving oscillation and stagnation; and III) Instance level, causing maladaptive over-thinking. Existing endeavors target isolated levels without unification, while their black-box nature and reliance on RL hinder explainability and controllability. To bridge these gaps, we conduct an in-depth white-box analysis, identifying key neurons (Mixture of Neurons, MoN) and their fluctuation patterns associated with distinct failures. Building upon these insights, we propose NeuReasoner, an explainable, controllable, and unified reasoning framework driven by MoN. Technically, NeuReasoner integrates lightweight MLPs for failure detection with a special token-triggered self-correction mechanism learned via SFT. During inference, special tokens are inserted upon failure detection to actuate controllable remedial behaviors. Extensive evaluations across six benchmarks, six backbone models (8B~70B) against nine competitive baselines, demonstrate that NeuReasoner achieves performance gains of up to 27.0% while reducing token consumption by 19.6% ~ 63.3%.

关键词: Large Reasoning Models, Mixture of Neurons, Self-correction, Explainable AI, Supervised Fine-tuning, Chain of Thought, Reasoning Framework, Performance Improvement

117. ❌ Open-Loop Planning, Closed-Loop Verification: Speculative Verification for VLA

作者: Zihua Wang, Zhitao Lin, Ruibo Li, Yu Zhang, Xu Yang, Siya Mi, Xiu-Shen Wei 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02965v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	8.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究Vision-Language-Action (VLA)模型在具身控制中的高效推理方法，属于大模型在机器人/控制领域的应用创新。核心相关关键词：1) ‘Large Language Models OR LLMs OR Foundation Models’ (10分)：VLA模型是大型基础模型，论文明确提到’large foundation models’，是研究的核心对象。2) ‘Speculative Decoding OR Inference Acceleration’ (8分)：论文提出’Speculative Verification’框架，通过结合开环规划和闭环验证来提升推理效率，属于推理加速技术，但并非传统的推测解码方法，因此给8分而非10分。其他关键词如MoE、量化、RAG等均未涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文针对Vision-Language-Action模型在动态环境中执行开环动作块预测时易受环境变化影响的问题，提出了Speculative Verification框架，通过结合低频宏观规划和轻量级在线验证，在保持高效性的同时提升了控制鲁棒性。

摘要翻译

视觉-语言-动作（Vision-Language-Action，VLA）模型作为具身控制的大型基础模型，在操作任务中展现出强大性能。然而，其高性能伴随着高昂的推理成本。为提升效率，近期研究采用动作分块（action chunking）方法，通过预测未来动作序列进行开环执行。尽管该方法能有效减少计算量，但由于缺乏闭环反馈，开环执行对环境变化敏感，且易产生误差累积。为克服这一局限，我们提出用于VLA控制的推测性验证框架（Speculative Verification for VLA Control，SV-VLA），该框架将高效的开环长程规划与轻量级闭环在线验证相结合。具体而言，SV-VLA使用重型VLA模型作为低频宏观规划器，生成动作块及规划上下文；同时，轻量级验证器基于最新观测持续监控执行过程。验证器结合当前观测与规划上下文，将规划动作与闭环参考动作进行比对，仅在必要时触发重新规划。实验表明，SV-VLA融合了分块预测的高效性与闭环控制的鲁棒性，能在动态环境中实现高效可靠的基于VLA的控制。代码已开源：https://github.com/edsad122/SV-VLA。

摘要 (Abstract)

Vision-Language-Action (VLA) models, as large foundation models for embodied control, have shown strong performance in manipulation tasks. However, their performance comes at high inference cost. To improve efficiency, recent methods adopt action chunking, which predicts a sequence of future actions for open-loop execution. Although effective for reducing computation, open-loop execution is sensitive to environmental changes and prone to error accumulation due to the lack of close-loop feedback. To address this limitation, we propose Speculative Verification for VLA Control (SV-VLA), a framework that combines efficient open-loop long-horizon planning with lightweight closed-loop online verification. Specifically, SV-VLA uses a heavy VLA as a low-frequency macro-planner to generate an action chunk together with a planning context, while a lightweight verifier continuously monitors execution based on the latest observations. Conditioned on both the current observation and the planning context, the verifier compares the planned action against a closed-loop reference action and triggers replanning only when necessary. Experiments demonstrate that SV-VLA combines the efficiency of chunked prediction with the robustness of closed-loop control, enabling efficient and reliable VLA-based control in dynamic environments. Code is available: https://github.com/edsad122/SV-VLA.

关键词: Vision-Language-Action models, embodied control, open-loop planning, closed-loop verification, speculative verification, inference efficiency, action chunking, dynamic environments

118. ❌ A Multi-head-based architecture for effective morphological tagging in Russian with open dictionary

作者: K. Skibin, M. Pozhidaev, S. Suschenko 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02926v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于俄语形态标注任务，提出了一种基于多头注意力的架构，属于自然语言处理中的特定任务应用。论文未涉及大模型、深度学习技术原理创新或大模型在不同领域的应用，也未提及任何评分关键词中的技术概念（如LLM、MoE、Scaling Laws、微调方法、推理技术、AI for Science等）。所有关键词均与论文内容完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于多头注意力的新架构，用于俄语形态标注，支持开放词典，在SinTagRus和Taiga数据集上对某些语法类别实现了98-99%的准确率，超越了先前结果。

摘要翻译

本文提出一种基于多头注意力机制的新型架构，以解决俄语形态标注问题。词向量预处理包括将词语拆分为子词单元，随后通过训练程序将子词单元向量聚合为完整词元向量。该方法支持开放词典，并能结合词语局部结构（如前缀、词尾等）分析形态特征。开放词典的设定使得模型未来能够分析训练数据集中未出现的词汇。在SinTagRus和Taiga数据集上进行的计算实验表明，对于部分语法范畴，所提架构达到了98-99%及以上的准确率，超越了已有研究成果。对于十分之九的词汇，该架构能精确预测所有语法范畴，并指出何时不应对词语进行范畴分析。同时，基于该架构的模型可在消费级图形加速器上完成训练，保留了多头注意力机制相对于循环神经网络的全部优势（本方法未使用RNN），无需在大型无标注文本库（如BERT）上进行预训练，且处理速度优于先前研究结果。

摘要 (Abstract)

The article proposes a new architecture based on Multi-head attention to solve the problem of morphological tagging for the Russian language. The preprocessing of the word vectors includes splitting the words into subtokens, followed by a trained procedure for aggregating the vectors of the subtokens into vectors for tokens. This allows to support an open dictionary and analyze morphological features taking into account parts of words (prefixes, endings, etc.). The open dictionary allows in future to analyze words that are absent in the training dataset. The performed computational experiment on the SinTagRus and Taiga datasets shows that for some grammatical categories the proposed architecture gives accuracy 98-99% and above, which outperforms previously known results. For nine out of ten words, the architecture precisely predicts all grammatical categories and indicates when the categories must not be analyzed for the word. At the same time, the model based on the proposed architecture can be trained on consumer-level graphics accelerators, retains all the advantages of Multi-head attention over RNNs (RNNs are not used in the proposed approach), does not require pretraining on large collections of unlabeled texts (like BERT), and shows higher processing speed than previous results.

关键词: morphological tagging, Russian language, multi-head attention, open dictionary, subtoken aggregation, SinTagRus, Taiga, computational experiment

119. ❌ BioUNER: A Benchmark Dataset for Clinical Urdu Named Entity Recognition

作者: Wazir Ali, Adeeb Noor, Sanaullah Mahar, Alia, Muhammad Mazhar Younas 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02904v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文主要关注生物医学领域的乌尔都语命名实体识别（NER）基准数据集构建，使用了传统机器学习（SVM）和深度学习模型（LSTM、mBERT、XLM-RoBERTa）进行评估。论文与大多数关键词（如LLM、MoE、Scaling Laws、RLHF、RAG、CoT、Agents等）完全无关，因为这些关键词涉及大模型技术原理、训练方法、推理优化、智能体等前沿方向，而本文仅使用基础深度学习模型进行NER任务。唯一的相关关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文涉及生物医学文本处理，属于AI在科学领域的应用，但创新性有限（主要是数据集构建和基础模型评估），因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该研究构建了一个用于临床乌尔都语命名实体识别的黄金标准基准数据集BioUNER，并通过评估SVM、LSTM、mBERT和XLM-RoBERTa等模型验证了其效用。

摘要翻译

本文提出了一种用于生物医学乌尔都语命名实体识别（BioUNER）的黄金标准基准数据集，该数据集通过爬取乌尔都语在线新闻门户的健康相关文章、医疗处方以及医院健康博客和网站内容构建而成。经过预处理后，三位熟悉医学领域的母语标注者使用Doccano文本标注工具参与了标注过程，共标注了153K个词元。标注完成后，对所提出的BioUNER数据集进行了内在和外在评估。标注者间一致性得分达到0.78，从而验证了该数据集符合黄金标准质量。为展示数据集的实用性和基准测试能力，我们评估了多种机器学习和深度学习模型，包括支持向量机（SVM）、长短期记忆网络（LSTM）、多语言BERT（mBERT）以及XLM-RoBERTa。这一黄金标准的BioUNER数据集可作为可靠的基准，并为乌尔都语语言处理资源增添了宝贵内容。

摘要 (Abstract)

In this article, we present a gold-standard benchmark dataset for Biomedical Urdu Named Entity Recognition (BioUNER), developed by crawling health-related articles from online Urdu news portals, medical prescriptions, and hospital health blogs and websites. After preprocessing, three native annotators with familiarity in the medical domain participated in the annotation process using the Doccano text annotation tool and annotated 153K tokens. Following annotation, the proposed BioiUNER dataset was evaluated both intrinsically and extrinsically. An inter-annotator agreement score of 0.78 was achieved, thereby validating the dataset as gold-standard quality. To demonstrate the utility and benchmarking capability of the dataset, we evaluated several machine learning and deep learning models, including Support Vector Machines (SVM), Long Short-Term Memory networks (LSTM), Multilingual BERT (mBERT), and XLM-RoBERTa. The gold-standard BioUNER dataset serves as a reliable benchmark and a valuable addition to Urdu language processing resources.

关键词: Biomedical Named Entity Recognition, Urdu language processing, Benchmark dataset, Clinical text, Deep learning models, Multilingual BERT, XLM-RoBERTa, Inter-annotator agreement

120. ❌ GRADE: Probing Knowledge Gaps in LLMs through Gradient Subspace Dynamics

作者: Yujing Wang, Yuanbang Liang, Yukun Lai, Hainan Zhang, Hanqi Yan 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02830v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文GRADE专注于LLM内部知识检测，通过梯度子空间动态来量化知识差距，与’Large Language Models’高度相关（10分），因为这是核心研究对象。与’Mechanistic Interpretability’高度相关（10分），因为方法涉及模型内部机制（梯度、隐藏状态）的解释性分析。与’Self-Correction’和’Hallucination Mitigation’有一定关联（各5分），因为知识差距检测有助于识别模型不足，可能间接支持自我改进和事实性提升。其他关键词如MoE、SFT、RAG等未在论文中涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文提出GRADE方法，通过分析LLM的梯度子空间动态来检测模型内部知识差距，以解决判断模型是否具备足够知识正确回答问题这一挑战，并在多个基准测试中验证了其有效性和鲁棒性。

摘要翻译

检测模型内部知识是否足以正确回答给定问题，是部署负责任大语言模型（LLM）时的核心挑战。除了通过LLM自报告以语言化表达置信度外，近期方法开始探索模型内部状态，例如通过响应词元的隐藏状态来捕捉知识被激活的程度。我们认为，此类被激活的知识可能与查询的实际需求不一致，例如可能捕获了与回答查询无关的文本风格或长度特征。为弥补这一不足，我们提出GRADE方法（基于梯度动态的知识缺口检测），该方法通过梯度在跨层间的秩比与对应隐藏状态子空间秩比的对比来量化知识缺口。其动机在于梯度具备作为给定目标所需知识更新估计量的特性。我们在六个基准测试上验证了GRADE，证明了其有效性以及对输入扰动的鲁棒性。此外，我们通过案例研究展示了梯度链如何为长文本答案生成可解释的知识缺口说明。

摘要 (Abstract)

Detecting whether a model’s internal knowledge is sufficient to correctly answer a given question is a fundamental challenge in deploying responsible LLMs. In addition to verbalising the confidence by LLM self-report, more recent methods explore the model internals, such as the hidden states of the response tokens to capture how much knowledge is activated. We argue that such activated knowledge may not align with what the query requires, e.g., capturing the stylistic and length-related features that are uninformative for answering the query. To fill the gap, we propose GRADE (Gradient Dynamics for knowledge gap detection), which quantifies the knowledge gap via the cross-layer rank ratio of the gradient to that of the corresponding hidden state subspace. This is motivated by the property of gradients as estimators of the required knowledge updates for a given target. We validate \modelname{} on six benchmarks, demonstrating its effectiveness and robustness to input perturbations. In addition, we present a case study showing how the gradient chain can generate interpretable explanations of knowledge gaps for long-form answers.

关键词: knowledge gap detection, gradient dynamics, LLM internals, hidden states, interpretable explanations, model confidence, responsible LLMs, cross-layer rank ratio

121. ❌ Student-in-the-Loop Chain-of-Thought Distillation via Generation-Time Selection

作者: Chaoqun He, Yingfa Chen, Chaojun Xiao, Xu Han, Lijie Wen 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02819v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	10.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	15.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大型推理模型（LLMs）向小型模型（SLMs）的知识蒸馏，特别是针对链式思维（CoT）推理轨迹的蒸馏。论文直接涉及’Large Language Models’（教师模型）、‘Small Language Models’（学生模型）和’Chain of Thought’（核心方法），这些是论文的核心内容，因此给予高分（10-15分）。论文也涉及深度推理过程（System 2 Thinking），因此给予10分。其他关键词如MoE、Scaling Laws、Pre-training、RLHF等，论文未涉及，因此给予0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Gen-SSD的学生参与式链式思维蒸馏框架，通过生成时选择机制，解决了将大型模型的复杂推理轨迹有效迁移到小型模型的难题，并在数学推理基准上显著超越了标准知识蒸馏和其他基线方法。

摘要翻译

大型推理模型通过长链思维轨迹在复杂任务上展现出强大性能，但直接将此类推理过程迁移至小型模型仍具挑战性。关键难点在于并非所有教师模型生成的推理轨迹都适合学生学习。现有方法通常依赖事后过滤，即基于启发式标准在完整生成后筛选轨迹。然而，此类方法无法控制生成过程本身，仍可能产生超出学生学习能力的推理路径。为突破这一局限，我们提出Gen-SSD（生成时自选择蒸馏），一种融入学生参与的循环框架，在生成过程中进行动态选择。该框架使学生不再被动接收完整轨迹，而是在教师模型的采样过程中评估候选延续步骤，仅引导可学习推理路径的扩展，并对无效分支实现早期剪枝。在数学推理基准测试上的实验表明，Gen-SSD持续优于标准知识蒸馏及近期基线方法：相较于标准知识蒸馏提升约5.9个百分点，较其他基线方法最高提升4.7个百分点。进一步分析显示，Gen-SSD能产生更稳定且易于学习的推理轨迹，这凸显了在生成过程中融入监督机制对实现高效蒸馏的重要性。

摘要 (Abstract)

Large reasoning models achieve strong performance on complex tasks through long chain-of-thought (CoT) trajectories, but directly transferring such reasoning processes to smaller models remains challenging. A key difficulty is that not all teacher-generated reasoning trajectories are suitable for student learning. Existing approaches typically rely on post-hoc filtering, selecting trajectories after full generation based on heuristic criteria. However, such methods cannot control the generation process itself and may still produce reasoning paths that lie outside the student’s learning capacity. To address this limitation, we propose Gen-SSD (Generation-time Self-Selection Distillation), a student-in-the-loop framework that performs generation-time selection. Instead of passively consuming complete trajectories, the student evaluates candidate continuations during the teacher’s sampling process, guiding the expansion of only learnable reasoning paths and enabling early pruning of unhelpful branches. Experiments on mathematical reasoning benchmarks demonstrate that Gen-SSD consistently outperforms standard knowledge distillation and recent baselines, with improvements of around 5.9 points over Standard KD and up to 4.7 points over other baselines. Further analysis shows that Gen-SSD produces more stable and learnable reasoning trajectories, highlighting the importance of incorporating supervision during generation for effective distillation.

关键词: knowledge distillation, chain-of-thought, reasoning models, student-in-the-loop, generation-time selection, mathematical reasoning, model compression, teacher-student learning

122. ❌ EnsemHalDet: Robust VLM Hallucination Detection via Ensemble of Internal State Detectors

作者: Ryuhei Miyazato, Shunsuke Kitada, Kei Harada 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02784v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于视觉语言模型（VLMs）的幻觉检测，这是大模型领域的一个重要子方向。与关键词’Large Language Models OR LLMs OR Foundation Models’高度相关（8分），因为VLMs是基础模型的一种。与’Hallucination Mitigation OR Factuality OR Truthfulness’高度相关（10分），因为这是论文的核心研究问题。与’Mechanistic Interpretability OR Explainable AI’相关（8分），因为论文利用内部表征（注意力输出、隐藏状态）进行检测，属于模型可解释性范畴。其他关键词（如MoE、SFT、RAG、Agents等）与论文内容无直接关联，均得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个名为EnsemHalDet的集成框架，通过组合多个基于视觉语言模型内部表征的检测器，显著提升了多模态幻觉检测的鲁棒性和准确性。

摘要翻译

视觉语言模型（VLMs）在多模态任务中表现出色，但其仍易产生与事实不符或脱离输入图像依据的幻觉现象。近期研究表明，利用模型内部表征进行幻觉检测的方法，比仅依赖模型输出的方案更为高效和准确。然而，现有基于内部表征的方法通常依赖单一表征或检测器，限制了其捕捉多样化幻觉信号的能力。本文提出EnsemHalDet，一种基于集成的幻觉检测框架，该框架综合利用视觉语言模型的多种内部表征，包括注意力输出与隐藏状态。EnsemHalDet为每种表征训练独立的检测器，并通过集成学习将其组合。在多个视觉问答（VQA）数据集及不同视觉语言模型上的实验结果表明，EnsemHalDet在AUC指标上持续优于现有方法及单检测器模型。这些结果证明，集成多样化的内部信号能显著提升多模态幻觉检测的鲁棒性。

摘要 (Abstract)

Vision-Language Models (VLMs) excel at multimodal tasks, but they remain vulnerable to hallucinations that are factually incorrect or ungrounded in the input image. Recent work suggests that hallucination detection using internal representations is more efficient and accurate than approaches that rely solely on model outputs. However, existing internal-representation-based methods typically rely on a single representation or detector, limiting their ability to capture diverse hallucination signals. In this paper, we propose EnsemHalDet, an ensemble-based hallucination detection framework that leverages multiple internal representations of VLMs, including attention outputs and hidden states. EnsemHalDet trains independent detectors for each representation and combines them through ensemble learning. Experimental results across multiple VQA datasets and VLMs show that EnsemHalDet consistently outperforms prior methods and single-detector models in terms of AUC. These results demonstrate that ensembling diverse internal signals significantly improves robustness in multimodal hallucination detection.

关键词: Vision-Language Models, Hallucination Detection, Internal Representations, Ensemble Learning, Multimodal Tasks, VQA, Robustness, Attention Outputs

123. ❌ When Modalities Remember: Continual Learning for Multimodal Knowledge Graphs

作者: Linyu Li, Zhi Jin, Yichi Zhang, Dongming Jin, Yuanpeng He, Haoran Duan, Gadeng Luosang, Nyima Tashi 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02778v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究持续多模态知识图谱推理（CMMKGR），属于多模态AI和持续学习领域，与深度学习相关。但论文未明确涉及大模型（LLMs）或特定的大模型技术（如MoE、量化、推理加速等）。唯一相关的是“Pre-training OR Continual Pre-training OR Domain Adaptation”，因为论文涉及持续学习（continual learning）和领域适应（处理动态图谱），但并非核心预训练技术，故给5分。其他关键词如AI for Science、RAG、Agents等均未涉及。

!!! tip deepseek-chat TL;DR

该论文针对动态多模态知识图谱中的灾难性遗忘问题，提出了一个持续多模态知识图谱推理模型MRCKG，通过多模态-结构协作课程、跨模态知识保留机制和多模态对比重放方案，有效保留了历史知识并显著提升了新知识的学习效果。

摘要翻译

现实世界的多模态知识图谱（Multimodal Knowledge Graphs, MMKGs）具有动态性，新的实体、关系和多模态知识会随时间不断涌现。现有的持续知识图谱推理（Continual Knowledge Graph Reasoning, CKGR）方法主要关注结构三元组，无法充分利用新实体带来的多模态信号。而现有的多模态知识图谱推理（Multimodal Knowledge Graph Reasoning, MMKGR）方法通常假设图谱是静态的，在图谱演化过程中会遭受灾难性遗忘。为填补这一空白，本文对持续多模态知识图谱推理（Continual Multimodal Knowledge Graph Reasoning, CMMKGR）进行了系统性研究。我们基于现有MMKG数据集构建了多个持续多模态知识图谱基准，并提出了一种新的CMMKGR模型——MRCKG。具体而言，MRCKG采用一种多模态-结构协同课程学习机制，根据新三元组与历史图谱的结构连通性及其多模态兼容性来调度渐进式学习。同时，模型引入了一种跨模态知识保留机制，通过保持实体表示稳定性、关系语义一致性和模态锚定来缓解遗忘。此外，模型采用了一种基于两阶段优化策略的多模态对比回放方案，通过多模态重要性采样和表示对齐来强化已学知识。在多个数据集上的实验表明，MRCKG在显著提升新知识学习能力的同时，有效保留了先前学到的多模态知识。

摘要 (Abstract)

Real-world multimodal knowledge graphs (MMKGs) are dynamic, with new entities, relations, and multimodal knowledge emerging over time. Existing continual knowledge graph reasoning (CKGR) methods focus on structural triples and cannot fully exploit multimodal signals from new entities. Existing multimodal knowledge graph reasoning (MMKGR) methods, however, usually assume static graphs and suffer catastrophic forgetting as graphs evolve. To address this gap, we present a systematic study of continual multimodal knowledge graph reasoning (CMMKGR). We construct several continual multimodal knowledge graph benchmarks from existing MMKG datasets and propose MRCKG, a new CMMKGR model. Specifically, MRCKG employs a multimodal-structural collaborative curriculum to schedule progressive learning based on the structural connectivity of new triples to the historical graph and their multimodal compatibility. It also introduces a cross-modal knowledge preservation mechanism to mitigate forgetting through entity representation stability, relational semantic consistency, and modality anchoring. In addition, a multimodal contrastive replay scheme with a two-stage optimization strategy reinforces learned knowledge via multimodal importance sampling and representation alignment. Experiments on multiple datasets show that MRCKG preserves previously learned multimodal knowledge while substantially improving the learning of new knowledge.

关键词: Continual Learning, Multimodal Knowledge Graphs, Catastrophic Forgetting, Knowledge Graph Reasoning, Multimodal Representation, Cross-modal Preservation, Contrastive Replay, Dynamic Graphs

124. ❌ Multiple-Debias: A Full-process Debiasing Method for Multilingual Pre-trained Language Models

作者: Haoyu Liang, Peijian Zeng, Wentao Huang, Aimin Yang, Dong Zhou 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02772v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究多语言预训练语言模型（MPLMs）的去偏方法，与’Large Language Models’高度相关（10分），因为MPLMs是LLMs的一种。论文涉及预训练和微调阶段，与’Pre-training’和’Post-training’相关（各8分），但未深入技术细节。论文明确使用参数高效微调（PEFT），与’PEFT’高度相关（10分）。其他关键词如MoE、SLMs、RAG、对齐等均未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Multiple-Debias的全流程多语言去偏方法，通过结合多语言反事实数据增强和自去偏技术，在预训练和微调阶段有效减少了多语言预训练语言模型在性别、种族和宗教方面的偏见。

摘要翻译

多语言预训练语言模型已成为自然语言处理的关键工具，但其常表现出涉及性别、种族与宗教等敏感属性的偏见。本文提出一种名为Multiple-Debias的综合多语言去偏方法，以应对跨语言偏见问题。通过在多语言反事实数据增强和多语言自去偏技术的基础上，结合预处理与后处理阶段的参数高效微调，我们在四种语言的三种敏感属性上显著降低了多语言预训练语言模型的偏见水平。同时，我们将CrowS-Pairs数据集扩展至德语、西班牙语、中文和日语，以此验证我们针对性别、种族和宗教偏见的全流程多语言去偏方法。实验结果表明：（1）多语言去偏方法在有效缓解偏见方面优于单语言方法；（2）整合不同语言的去偏信息能显著提升多语言预训练语言模型的公平性。

摘要 (Abstract)

Multilingual Pre-trained Language Models (MPLMs) have become essential tools for natural language processing. However, they often exhibit biases related to sensitive attributes such as gender, race, and religion. In this paper, we introduce a comprehensive multilingual debiasing method named Multiple-Debias to address these issues across multiple languages. By incorporating multilingual counterfactual data augmentation and multilingual Self-Debias across both pre-processing and post-processing stages, alongside parameter-efficient fine-tuning, we significantly reduced biases in MPLMs across three sensitive attributes in four languages. We also extended CrowS-Pairs to German, Spanish, Chinese, and Japanese, validating our full-process multilingual debiasing method for gender, racial, and religious bias. Our experiments show that (i) multilingual debiasing methods surpass monolingual approaches in effectively mitigating biases, and (ii) integrating debiasing information from different languages notably improves the fairness of MPLMs.

关键词: Multilingual Pre-trained Language Models, Debiasing, Counterfactual Data Augmentation, Self-Debias, Parameter-efficient Fine-tuning, Fairness, Gender Bias, Racial Bias

125. ❌ Generative Frontiers: Why Evaluation Matters for Diffusion Language Models

作者: Patrick Pynadath, Jiaxin Shi, Ruqi Zhang 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02718v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于扩散语言模型的评估方法，属于大模型技术范畴，因此与’Large Language Models’有一定关联（5分）。但论文未涉及其他关键词的具体技术（如MoE、SFT、RAG等）或应用领域（如AI for Science），故其他关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文讨论了扩散语言模型评估方法的局限性，提出了基于生成困惑度和熵分解的生成前沿方法，以改进模型生成质量的评估。

摘要翻译

扩散语言模型近期取得了令人振奋的进展，相比自回归模型，其在生成轨迹上提供了更大的灵活性。这种灵活性促使越来越多的研究投入到扩散语言建模的新方法中，这些研究通常以GPT-2小规模模型（1.5亿参数）为起点展开。然而，这些进展也带来了评估方法上的新问题。在本技术报告中，我们讨论了当前方法的局限性，并提出了原则性的改进方案，以确保可靠的比较。我们首先探讨了为何OpenWebText已成为标准基准，以及为何LM1B等替代方案本质上意义较小。接着，我们分析了似然评估在扩散模型中的局限，并解释了为何仅依赖生成困惑度作为指标可能导致无意义的结果。为解决这一问题，我们证明了生成困惑度与熵是模型与参考分布之间KL散度的两个组成部分。这一分解揭示了生成困惑度对熵的敏感性，并自然地提出了生成前沿作为评估模型生成质量的原则性方法。最后，我们基于该规模下的模型质量给出了实证观察。我们附上了一篇包含交互内容的博客文章，以图解相关论点，详见https://patrickpynadath1.github.io/blog/eval_methodology/。

摘要 (Abstract)

Diffusion language models have seen exciting recent progress, offering far more flexibility in generative trajectories than autoregressive models. This flexibility has motivated a growing body of research into new approaches to diffusion language modeling, which typically begins at the scale of GPT-2 small (150 million parameters). However, these advances introduce new issues with evaluation methodology. In this technical note, we discuss the limitations of current methodology and propose principled augmentations to ensure reliable comparisons. We first discuss why OpenWebText has become the standard benchmark, and why alternatives such as LM1B are inherently less meaningful. We then discuss the limitations of likelihood evaluations for diffusion models, and explain why relying on generative perplexity alone as a metric can lead to uninformative results. To address this, we show that generative perplexity and entropy are two components of the KL divergence to a reference distribution. This decomposition explains generative perplexity’s sensitivity to entropy, and naturally suggests generative frontiers as a principled method for evaluating model generative quality. We conclude with empirical observations on model quality at this scale. We include a blog post with interactive content to illustrate the argument at https://patrickpynadath1.github.io/blog/eval_methodology/.

关键词: Diffusion Language Models, Evaluation Methodology, Generative Perplexity, KL Divergence, Generative Frontiers, Model Quality, Benchmark, OpenWebText

126. ❌ Breakdowns in Conversational AI: Interactional Failures in Emotionally and Ethically Sensitive Contexts

作者: Jiawen Deng, Wentao Zhang, Ziyun Jiao, Fuji Ren 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02713v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究对话AI在情感和伦理敏感情境中的交互失败，与’Instruction Tuning OR Alignment OR Value Alignment’高度相关（10分），因为核心关注AI的价值对齐和伦理一致性；与’Large Language Models OR LLMs OR Foundation Models’有一定关联（8分），因为研究基于主流对话模型；其他关键词主要涉及具体技术方法或领域应用，与论文的HCI/交互诊断焦点无关（0分）。

!!! tip deepseek-chat TL;DR

该研究探讨了对话AI在情感和伦理敏感情境中出现的交互故障，发现主流模型存在情感错位、伦理指导失败等反复性故障模式，并提出了分类和改进视角。

摘要翻译

对话式人工智能正日益被部署于情感负荷高且伦理敏感的人机交互场景中。先前研究主要集中于情感基准测试或静态安全审查，忽视了在动态对话过程中对齐机制如何展开。我们探讨以下研究问题：当对话智能体面临情感与伦理敏感行为时，会出现何种系统性故障？这些故障如何影响对话质量？为压力测试聊天机器人性能，我们开发了一种基于人格条件的用户模拟器，该模拟器能够通过预设心理人格与分阶段情感节奏进行多轮对话。分析表明，主流模型存在随着情感轨迹升级而加剧的重复性故障。我们识别出若干常见失效模式，包括情感错位、伦理引导失效，以及共情超越或削弱责任感的跨维度权衡现象。我们将这些模式归纳为分类体系，并探讨其设计启示，强调在动态交互中保持伦理一致性与情感敏感度的必要性。本研究为人机交互领域在价值敏感与情感负荷场景下的对话式人工智能诊断与改进提供了新视角。

摘要 (Abstract)

Conversational AI is increasingly deployed in emotionally charged and ethically sensitive interactions. Previous research has primarily concentrated on emotional benchmarks or static safety checks, overlooking how alignment unfolds in evolving conversation. We explore the research question: what breakdowns arise when conversational agents confront emotionally and ethically sensitive behaviors, and how do these affect dialogue quality? To stress-test chatbot performance, we develop a persona-conditioned user simulator capable of engaging in multi-turn dialogue with psychological personas and staged emotional pacing. Our analysis reveals that mainstream models exhibit recurrent breakdowns that intensify as emotional trajectories escalate. We identify several common failure patterns, including affective misalignments, ethical guidance failures, and cross-dimensional trade-offs where empathy supersedes or undermines responsibility. We organize these patterns into a taxonomy and discuss the design implications, highlighting the necessity to maintain ethical coherence and affective sensitivity throughout dynamic interactions. The study offers the HCI community a new perspective on the diagnosis and improvement of conversational AI in value-sensitive and emotionally charged contexts.

关键词: Conversational AI, Emotionally Sensitive Contexts, Ethically Sensitive Contexts, Interactional Failures, Alignment, Affective Misalignments, Ethical Guidance, HCI

127. ❌ Redirected, Not Removed: Task-Dependent Stereotyping Reveals the Limits of LLM Alignments

作者: Divyanshu Kumar, Ishita Gupta, Nitin Aravind Birur, Tanay Baswa, Sahil Agarwal, Prashanth Harshangi 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02669v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM的偏见评估和alignment效果，直接相关关键词为’Large Language Models’和’Instruction Tuning OR Alignment OR Value Alignment’，给10分。论文涉及偏见评估和模型行为分析，与’Factuality’和’Explainable AI’有一定关联，给5分。其他关键词如MoE、SLMs、Scaling Laws、RLHF、RAG等均未在论文中涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文研究发现，大型语言模型的偏见表现具有任务依赖性，在显性任务中会对抗刻板印象，但在隐性任务中会重现刻板印象，且当前的对齐实践掩盖了表征性危害而非减轻它。

摘要翻译

语言模型存在多大偏见？答案取决于提问方式。一个拒绝为领导职位选择特定种姓的模型，在填空任务中却会稳定地将上等种姓与纯洁性关联，将低等种姓与缺乏卫生条件关联。单一任务基准测试会遗漏这种矛盾，因为它们仅捕捉模型偏见特征的一个切片。我们提出了一个涵盖9种偏见类型的层次化分类体系，包括种姓、语言和地域偏见等研究不足的维度，并通过7项评估任务进行可操作化测量，这些任务涵盖从显式决策到隐式关联的完整光谱。通过对7个商业及开源权重大语言模型进行约4.5万次提示审计，我们发现了三个系统性规律：首先，偏见具有任务依赖性——模型在显式探测中对抗刻板印象，却在隐式任务中复现这些偏见，同一模型对相同身份群体的刻板印象评分在不同任务类型间差异最高达0.43；其次，安全对齐具有非对称性——模型拒绝将负面特质赋予边缘化群体，却自由地将积极特质与特权群体关联；第三，研究不足的偏见维度在所有模型中表现出最强的刻板印象，表明对齐工作的重点遵循基准测试的覆盖范围而非危害严重性。这些结果证明，单一基准审计会系统性地误判大语言模型的偏见特征，当前的对齐实践更多是掩盖表征性危害而非真正缓解它。

摘要 (Abstract)

How biased is a language model? The answer depends on how you ask. A model that refuses to choose between castes for a leadership role will, in a fill-in-the-blank task, reliably associate upper castes with purity and lower castes with lack of hygiene. Single-task benchmarks miss this because they capture only one slice of a model’s bias profile. We introduce a hierarchical taxonomy covering 9 bias types, including under-studied axes like caste, linguistic, and geographic bias, operationalized through 7 evaluation tasks that span explicit decision-making to implicit association. Auditing 7 commercial and open-weight LLMs with \textasciitilde45K prompts, we find three systematic patterns. First, bias is task-dependent: models counter stereotypes on explicit probes but reproduce them on implicit ones, with Stereotype Score divergences up to 0.43 between task types for the same model and identity groups. Second, safety alignment is asymmetric: models refuse to assign negative traits to marginalized groups, but freely associate positive traits with privileged ones. Third, under-studied bias axes show the strongest stereotyping across all models, suggesting alignment effort tracks benchmark coverage rather than harm severity. These results demonstrate that single-benchmark audits systematically mischaracterize LLM bias and that current alignment practices mask representational harm rather than mitigating it.

关键词: Large Language Models, Bias Evaluation, Alignment, Stereotyping, Task-dependent Bias, Safety Alignment, Representational Harm, Auditing

128. ❌ SocioEval: A Template-Based Framework for Evaluating Socioeconomic Status Bias in Foundation Models

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs/Foundation Models）的社会经济地位偏见评估，因此该关键词得10分（高度相关）。论文涉及偏见评估与缓解，与’Alignment’（价值对齐）和’Factuality’（真实性/幻觉缓解）有一定关联，各得5分。论文的评估框架涉及模型行为解释，与’Explainable AI’有一定关联，得5分。其他关键词如MoE、SLMs、训练技术、推理优化、智能体等均未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了SocioEval框架来系统评估基础模型中的社会经济地位偏见，通过对13个前沿大语言模型的测试发现偏见率存在显著差异（0.42%-33.75%），且偏见在不同主题中表现不同，部署防护措施能防止显性歧视但对领域特定刻板印象效果有限。

摘要翻译

随着大型语言模型（LLM）日益在关键领域的决策系统中发挥核心作用，理解并减轻其偏见对于负责任的人工智能部署至关重要。尽管针对种族、性别等属性的偏见评估框架已大量涌现，但社会经济地位偏见——尽管在现实世界中影响广泛——仍未得到充分探索。我们提出了SocioEval，一个基于模板的框架，旨在通过决策任务系统性地评估基础模型中的社会经济偏见。我们的分层框架涵盖8个主题和18个话题，生成了跨越6种类别对组合的240个提示。我们采用严格的三阶段标注协议，基于3,120条回复评估了13个前沿LLM，揭示了显著的偏见率差异（0.42%-33.75%）。研究结果表明，偏见在不同主题中的表现存在差异：生活方式判断中的偏见程度是教育相关决策的10倍；同时，现有的部署防护措施虽能有效阻止显性歧视，但对特定领域的刻板印象表现出脆弱性。SocioEval为审计语言模型中基于社会阶层的偏见提供了一个可扩展、可延伸的基础。

摘要 (Abstract)

As Large Language Models (LLMs) increasingly power decision-making systems across critical domains, understanding and mitigating their biases becomes essential for responsible AI deployment. Although bias assessment frameworks have proliferated for attributes such as race and gender, socioeconomic status bias remains significantly underexplored despite its widespread implications in the real world. We introduce SocioEval, a template-based framework for systematically evaluating socioeconomic bias in foundation models through decision-making tasks. Our hierarchical framework encompasses 8 themes and 18 topics, generating 240 prompts across 6 class-pair combinations. We evaluated 13 frontier LLMs on 3,120 responses using a rigorous three-stage annotation protocol, revealing substantial variation in bias rates (0.42%-33.75%). Our findings demonstrate that bias manifests differently across themes lifestyle judgments show 10$\times$ higher bias than education-related decisions and that deployment safeguards effectively prevent explicit discrimination but show brittleness to domain-specific stereotypes. SocioEval provides a scalable, extensible foundation for auditing class-based bias in language models.

关键词: Large Language Models, Foundation Models, Socioeconomic Bias, Bias Evaluation, Decision-making Systems, Responsible AI, Template-based Framework, Model Auditing

129. ❌ Revealing the Learning Dynamics of Long-Context Continual Pre-training

作者: Yupu Liang, Shuang Chen, Guanwei Zhang, Shaolei Wang, Suncong Zheng 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02650v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	8.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	10.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究长上下文持续预训练（LCCP）在工业级大语言模型（Hunyuan-A13B，800亿参数）上的学习动态，直接高度相关于’Large Language Models’、‘Pre-training/Continual Pre-training’和’Context Window Extension/Long Context LLMs’（均10分）。论文涉及数据规模（Scaling Laws AND Data Quality，8分）、监督微调（SFT，8分）和机制可解释性（Mechanistic Interpretability，8分）。其他关键词如MoE、SLMs、对齐、RAG、推理、代理、压缩等未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

本文系统研究了工业级大语言模型在长上下文持续预训练中的学习动态，发现需要大规模数据（超过1500亿词元）才能达到饱和，传统评估方法存在'欺骗性饱和'，而基于困惑度的分析和检索头注意力模式能更可靠地监控训练进展。

摘要翻译

现有关于长上下文持续预训练（Long-Context Continual Pre-training, LCCP）的研究主要集中于小规模模型和有限数据规模（数百亿词元）下的设定。我们认为，直接将此类小规模设定迁移至工业级模型存在适应不足与训练过早终止的风险。此外，当前评估方法过度依赖下游基准测试（如“大海捞针”任务），往往无法反映模型内在的收敛状态，并可能导致“欺骗性饱和”。本文首次基于工业级模型 Hunyuan-A13B（总参数量 800 亿）对 LCCP 的学习动态进行了系统性研究，追踪了其在 2000 亿词元训练轨迹上的演化过程。具体而言，我们提出了一个分层分析框架，从行为层面（基于监督微调的探测）、概率层面（困惑度）以及机制层面（注意力模式）对 LCCP 动态进行剖析。我们的研究发现：（1）大规模数据扩展的必要性：数百亿词元的训练规模对于工业级大语言模型的 LCCP 是不充分的（例如 Hunyuan-A13B 需训练超过 1500 亿词元才达到饱和）。（2）欺骗性饱和与内在饱和：传统的“大海捞针”评分会过早报告“虚假饱和”，而基于困惑度的分析则揭示了模型内在能力的持续提升，且与下游性能表现出更强的相关性。（3）训练稳定性的机制监测：检索头可作为高效、低成本的训练监测器，其动态变化的注意力分数能够可靠地追踪 LCCP 进展，并与监督微调结果高度相关。本研究为工业级大语言模型的 LCCP 提供了一个全面的监测框架、评估体系及机制性解释。

摘要 (Abstract)

Existing studies on Long-Context Continual Pre-training (LCCP) mainly focus on small-scale models and limited data regimes (tens of billions of tokens). We argue that directly migrating these small-scale settings to industrial-grade models risks insufficient adaptation and premature training termination. Furthermore, current evaluation methods rely heavily on downstream benchmarks (e.g., Needle-in-a-Haystack), which often fail to reflect the intrinsic convergence state and can lead to “deceptive saturation”. In this paper, we present the first systematic investigation of LCCP learning dynamics using the industrial-grade Hunyuan-A13B (80B total parameters), tracking its evolution across a 200B-token training trajectory. Specifically, we propose a hierarchical framework to analyze LCCP dynamics across behavioral (supervised fine-tuning probing), probabilistic (perplexity), and mechanistic (attention patterns) levels. Our findings reveal: (1) Necessity of Massive Data Scaling: Training regimes of dozens of billions of tokens are insufficient for industrial-grade LLMs’ LCCP (e.g., Hunyuan-A13B reaches saturation after training over 150B tokens). (2) Deceptive Saturation vs. Intrinsic Saturation: Traditional NIAH scores report “fake saturation” early, while our PPL-based analysis reveals continuous intrinsic improvements and correlates more strongly with downstream performance. (3) Mechanistic Monitoring for Training Stability: Retrieval heads act as efficient, low-resource training monitors, as their evolving attention scores reliably track LCCP progress and exhibit high correlation with SFT results. This work provides a comprehensive monitoring framework, evaluation system, and mechanistic interpretation for the LCCP of industrial-grade LLM.

关键词: Long-Context Continual Pre-training, Large Language Models, Industrial-grade LLMs, Learning Dynamics, Training Saturation, Mechanistic Interpretation, Attention Patterns, Hunyuan-A13B

130. ❌ Reinforcement Learning-based Knowledge Distillation with LLM-as-a-Judge

作者: Yiyang Shen, Lifu Tu, Weiran Wang 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02621v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究基于强化学习（RL）框架，使用LLM作为评判者（LLM-as-a-Judge）来生成奖励信号，以进行无监督的知识蒸馏。因此，与’Large Language Models’高度相关（10分），因为LLM是核心组件；与’RLHF’等强化学习对齐技术高度相关（10分），因为论文提出了基于LLM评判的RL框架，属于RL对齐的范畴。其他关键词如MoE、SLMs、Scaling Laws、PEFT、RAG、推理加速、AI for Science等均未在摘要中提及，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于强化学习的知识蒸馏框架，使用大型语言模型作为评判者来评估模型输出，从而在无需真实标签的情况下生成有效的训练信号，并在数学推理基准上取得了显著性能提升。

摘要翻译

强化学习（Reinforcement Learning, RL）已被证明能显著提升小型及大型语言模型（Large Language Models, LLMs）的推理能力，但现有方法通常依赖于可验证的奖励，即需要真实标签。我们提出一种强化学习框架，该框架利用一个作为评判者的LLM来评估模型在大量无标签数据上的输出，并以此生成奖励信号，从而实现无需标签的知识蒸馏，并取代对真实监督的需求。值得注意的是，该评判者仅输出单个标记，使得奖励计算高效。当与可验证的奖励结合使用时，我们的方法在多项数学推理基准测试中带来了显著的性能提升。这些结果表明，基于LLM的评估器能够为RL微调产生有效的训练信号。

摘要 (Abstract)

Reinforcement Learning (RL) has been shown to substantially improve the reasoning capability of small and large language models (LLMs), but existing approaches typically rely on verifiable rewards, hence ground truth labels. We propose an RL framework that uses rewards from an LLM that acts as a judge evaluating model outputs over large amounts of unlabeled data, enabling label-free knowledge distillation and replacing the need of ground truth supervision. Notably, the judge operates with a single-token output, making reward computation efficient. When combined with verifiable rewards, our approach yields substantial performance gains across math reasoning benchmarks. These results suggest that LLM-based evaluators can produce effective training signals for RL fine-tuning.

关键词: Reinforcement Learning, Knowledge Distillation, LLM-as-a-Judge, Unlabeled Data, Math Reasoning, Fine-tuning, Reward Computation, Performance Gains

131. ❌ CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning

作者: Ankan Deria, Komal Kumar, Xilin He, Imran Razzak, Hisham Cholakkal, Fahad Shahbaz Khan, Salman Khan 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03231v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	5.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究视觉-语言模型（VLM）的架构创新，通过融合对比学习和自监督两种互补的视觉编码器来提升性能。与’Large Language Models’相关（8分），因为论文将融合后的视觉特征注入到仅解码器的LLM中；与’Pre-training’相关（5分），因为涉及CLIP-style预训练和DINO自监督预训练；与’Model Merging’相关（5分），因为核心是融合两种不同预训练方式的视觉编码器特征。其他关键词如MoE、SFT、RAG等均未在论文中涉及，故得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为CoME-VL的互补多编码器视觉-语言学习框架，通过融合对比学习和自监督两种视觉编码器的特征，显著提升了视觉理解（平均4.9%）和视觉定位（平均5.4%）任务的性能。

摘要翻译

近期视觉-语言模型通常依赖于通过对比式图像-文本目标（如CLIP风格预训练）训练的单一视觉编码器。虽然对比式编码器在跨模态对齐与检索任务中表现优异，但自监督视觉编码器往往能捕捉更丰富的稠密语义特征，并在识别与理解任务中展现出更强的鲁棒性。本研究旨在探索如何规模化融合这些互补的视觉表征以提升视觉-语言建模性能。我们提出CoME-VL：互补多编码器视觉-语言模型，这是一个模块化融合框架，通过整合对比训练视觉编码器与自监督DINO编码器实现优势互补。该方法通过以下机制实现表征级融合：（1）采用熵引导的多层聚合配合正交约束投影以减少冗余；（2）引入RoPE增强的交叉注意力机制以对齐异构令牌网格，并生成紧凑的融合视觉令牌。融合后的令牌可注入仅解码器架构的大语言模型中，且对标准视觉-语言模型流程改动极小。在多样化视觉-语言基准测试上的大量实验表明，CoME-VL始终优于单编码器基线模型。具体而言，我们在视觉理解任务中观察到平均4.9%的性能提升，在视觉定位任务中提升5.4%。本方法在RefCOCO检测任务上达到最先进性能，同时较基线实现显著超越。最后，我们通过消融实验对层级融合策略、非冗余特征混合机制以及融合容量进行系统分析，以评估对比式与自监督信号的互补性如何影响视觉-语言模型性能。

摘要 (Abstract)

Recent vision-language models (VLMs) typically rely on a single vision encoder trained with contrastive image-text objectives, such as CLIP-style pretraining. While contrastive encoders are effective for cross-modal alignment and retrieval, self-supervised visual encoders often capture richer dense semantics and exhibit stronger robustness on recognition and understanding tasks. In this work, we investigate how to scale the fusion of these complementary visual representations for vision-language modeling. We propose CoME-VL: Complementary Multi-Encoder Vision-Language, a modular fusion framework that integrates a contrastively trained vision encoder with a self-supervised DINO encoder. Our approach performs representation-level fusion by (i) entropy-guided multi-layer aggregation with orthogonality-constrained projections to reduce redundancy, and (ii) RoPE-enhanced cross-attention to align heterogeneous token grids and produce compact fused visual tokens. The fused tokens can be injected into a decoder-only LLM with minimal changes to standard VLM pipelines. Extensive experiments across diverse vision-language benchmarks demonstrate that CoME-VL consistently outperforms single-encoder baselines. In particular, we observe an average improvement of 4.9% on visual understanding tasks and 5.4% on grounding tasks. Our method achieves state-of-the-art performance on RefCOCO for detection while improving over the baseline by a large margin. Finally, we conduct ablation studies on layer merging, non-redundant feature mixing, and fusion capacity to evaluate how complementary contrastive and self-supervised signals affect VLM performance.

关键词: vision-language models, multi-encoder fusion, contrastive learning, self-supervised learning, visual representation, cross-modal alignment, decoder-only LLM, state-of-the-art performance

132. ❌ HyperCT: Low-Rank Hypernet for Unified Chest CT Analysis

作者: Fengbei Liu, Sunwoo Kwak, Hao Phung, Nusrat Binta Nizam, Ilan Richter, Nir Uriel, Hadar Averbuch-Elor, Daborah Estrin, Mert R. Sabuncu 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03224v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文HyperCT专注于医学影像分析（胸部CT），使用Vision Transformer和Hypernetwork进行多任务学习，核心创新是集成LoRA（Low-Rank Adaptation）以实现参数高效调优。因此，仅与关键词’PEFT OR LoRA OR Parameter-efficient Fine-tuning’高度相关（10分），因为LoRA是论文的核心方法；与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（8分），因其属于AI在生物医学（放射学）领域的应用。其他关键词均涉及大语言模型（LLM）或非视觉领域的AI技术，与论文的计算机视觉和医学影像焦点无关，故评0分。

!!! tip deepseek-chat TL;DR

该论文提出HyperCT框架，通过集成低秩适应（LoRA）的Hypernetwork动态调整Vision Transformer，以参数高效的方式统一处理胸部CT的多任务分析，在放射学和心脏病学任务上优于现有基线。

摘要翻译

非对比胸部CT为传统肺部筛查及机会性肺外筛查提供了丰富可能。虽然多任务学习（MTL）能够统一这些多样化任务，但标准的硬参数共享方法在建模不同病理时往往效果欠佳。我们提出HyperCT框架，该框架通过超网络动态适配视觉Transformer主干网络。为确保计算效率，我们集成了低秩自适应（LoRA）技术，使模型能够回归任务特定的低秩权重更新而非完整参数。在涵盖放射学与心脏病学任务的大规模数据集上进行验证，本方法超越了多种强基线模型，为整体患者评估提供了一个统一的、参数高效的解决方案。我们的代码公开于https://github.com/lfb-1/HyperCT。

摘要 (Abstract)

Non-contrast chest CTs offer a rich opportunity for both conventional pulmonary and opportunistic extra-pulmonary screening. While Multi-Task Learning (MTL) can unify these diverse tasks, standard hard-parameter sharing approaches are often suboptimal for modeling distinct pathologies. We propose HyperCT, a framework that dynamically adapts a Vision Transformer backbone via a Hypernetwork. To ensure computational efficiency, we integrate Low-Rank Adaptation (LoRA), allowing the model to regress task-specific low-rank weight updates rather than full parameters. Validated on a large-scale dataset of radiological and cardiological tasks, \method{} outperforms various strong baselines, offering a unified, parameter-efficient solution for holistic patient assessment. Our code is available at https://github.com/lfb-1/HyperCT.

关键词: Chest CT analysis, Multi-Task Learning, Hypernetwork, Vision Transformer, Low-Rank Adaptation, Parameter-efficient, Radiological tasks, Cardiological tasks

133. ❌ VOSR: A Vision-Only Generative Model for Image Super-Resolution

作者: Rongyuan Wu, Lingchen Sun, Zhengqiang Zhang, Xiangtao Kong, Jixin Zhao, Shihao Wang, Lei Zhang 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03225v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的图像超分辨率任务，提出了一种纯视觉生成模型VOSR。虽然论文涉及生成模型、训练策略和效率优化，但所有关键词都明确针对大语言模型（LLM）及其相关技术（如对齐、推理、代理等），而本文完全不涉及语言模型、文本处理或LLM技术栈。论文的核心是视觉编码器、扩散模型和图像恢复，与给定的LLM关键词无直接关联。

!!! tip deepseek-chat TL;DR

该研究提出了一种纯视觉生成模型VOSR用于图像超分辨率，通过视觉语义引导和恢复导向的指导策略，在无需多模态预训练的情况下实现了与基于文本到图像模型的方法相竞争甚至更优的性能，同时大幅降低了训练成本。

摘要翻译

近期多数生成式图像超分辨率方法依赖于对网络规模文本-图像数据预训练的大型文生图扩散模型进行适配。尽管有效，但该范式始于通用文生图生成器，而超分辨率本质上是基于低分辨率输入条件的图像复原任务。本研究探讨仅使用视觉数据训练的超分辨率模型能否与基于文生图的方法相媲美。为此，我们提出VOSR——一种纯视觉生成式超分辨率框架。我们首先通过预训练的视觉编码器从低分辨率输入中提取语义丰富且空间基础的特征作为视觉语义引导。随后重新审视生成模型训练中的无分类器引导策略，发现标准无条件分支不适用于从头训练的复原模型。因此我们将其替换为一种保留弱低分辨率锚点的复原导向引导策略。基于这些设计，我们首先从头训练多步VOSR模型，进而将其蒸馏为单步模型以实现高效推理。VOSR所需的训练成本不足代表性文生图超分辨率方法的十分之一，但在多步与单步设定下均能实现相当甚至更优的感知质量与效率，同时在合成与真实世界基准测试中生成更忠实结构并减少幻象。我们的研究首次证明，无需多模态预训练即可实现高质量生成式超分辨率。代码与模型可通过https://github.com/cswry/VOSR获取。

摘要 (Abstract)

Most of the recent generative image super-resolution (SR) methods rely on adapting large text-to-image (T2I) diffusion models pretrained on web-scale text-image data. While effective, this paradigm starts from a generic T2I generator, despite that SR is fundamentally a low-resolution (LR) input-conditioned image restoration task. In this work, we investigate whether an SR model trained purely on visual data can rival T2I-based ones. To this end, we propose VOSR, a Vision-Only generative framework for SR. We first extract semantically rich and spatially grounded features from the LR input using a pretrained vision encoder as visual semantic guidance. We then revisit classifier-free guidance for training generative models and show that the standard unconditional branch is ill-suited to restoration models trained from scratch. We therefore replace it with a restoration-oriented guidance strategy that preserves weak LR anchors. Built upon these designs, we first train a multi-step VOSR model from scratch and then distill it into a one-step model for efficient inference. VOSR requires less than one-tenth of the training cost of representative T2I-based SR methods, yet in both multi-step and one-step settings, it achieves competitive or even better perceptual quality and efficiency, while producing more faithful structures with fewer hallucinations on both synthetic and real-world benchmarks. Our results, for the first time, show that high-quality generative SR can be achieved without multimodal pretraining. The code and models can be found at https://github.com/cswry/VOSR.

关键词: image super-resolution, generative model, vision-only, diffusion model, classifier-free guidance, training efficiency, hallucination reduction, visual semantic guidance

134. ❌ ProtoFlow: Mitigating Forgetting in Class-Incremental Remote Sensing Segmentation via Low-Curvature Prototype Flow

作者: Jiekai Wu, Rong Fu, Chuangqi Li, Zijian Zhang, Guangxin Wu, Hao Zhang, Shiyin Lin, Jianyuan Ni, Yang Li, Dongxu Zhang, Amir H. Gandomi, Simon Fong, Pengbin Feng 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03212v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文专注于遥感图像分割中的持续学习问题，提出ProtoFlow框架来缓解灾难性遗忘，属于计算机视觉和机器学习领域。与大多数大模型技术关键词（如LLMs、MoE、RLHF等）完全无关，因为这些关键词主要针对自然语言处理和大语言模型。仅与两个关键词有微弱关联：1）‘Pre-training OR Continual Pre-training OR Domain Adaptation’（5分）：论文涉及持续学习（continual learning），可视为一种特殊的领域适应（domain adaptation）或持续训练场景，但并非大模型预训练；2）‘AI for Science OR Bioinformatics OR Cheminformatics’（8分）：遥感图像分割属于科学应用（AI for Science），特别是地球科学和遥感领域，因此有一定相关性。其他关键词均不适用。

!!! tip deepseek-chat TL;DR

该论文针对遥感图像分割中的持续学习问题，提出了ProtoFlow框架，通过建模类原型的低曲率轨迹演化来稳定表示并减少遗忘，在标准基准测试中实现了mIoU指标的显著提升。

摘要翻译

实际部署中的遥感分割本质上是持续性的：新的语义类别不断涌现，采集条件也会随季节、城市和传感器而变化。尽管近期研究取得了进展，许多增量学习方法仍将训练步骤视为孤立的更新过程，导致表征漂移和遗忘问题未能得到充分控制。本文提出ProtoFlow，一种时间感知的原型动态框架，该框架将类别原型建模为轨迹，并通过显式的时间向量场学习其演化规律。通过联合实施低曲率运动约束与类间分离机制，ProtoFlow在持续学习过程中稳定了原型几何结构。在标准的类别增量与域增量遥感基准测试上的实验表明，该方法相较于强基线模型取得了一致性提升，包括mIoUall指标最高提升1.5-2.0个百分点，同时减少了遗忘现象。这些结果表明，显式建模原型的时间演化过程是实现鲁棒性持续遥感分割的一种实用且可解释的策略。

摘要 (Abstract)

Remote sensing segmentation in real deployment is inherently continual: new semantic categories emerge, and acquisition conditions shift across seasons, cities, and sensors. Despite recent progress, many incremental approaches still treat training steps as isolated updates, which leaves representation drift and forgetting insufficiently controlled. We present ProtoFlow, a time-aware prototype dynamics framework that models class prototypes as trajectories and learns their evolution with an explicit temporal vector field. By jointly enforcing low-curvature motion and inter-class separation, ProtoFlow stabilizes prototype geometry throughout incremental learning. Experiments on standard class- and domain-incremental remote sensing benchmarks show consistent gains over strong baselines, including up to 1.5-2.0 points improvement in mIoUall, together with reduced forgetting. These results suggest that explicitly modeling temporal prototype evolution is a practical and interpretable strategy for robust continual remote sensing segmentation.

关键词: continual learning, remote sensing segmentation, class-incremental learning, prototype evolution, forgetting mitigation, low-curvature motion, temporal dynamics, domain adaptation

135. ❌ The Eleventh NTIRE 2026 Efficient Super-Resolution Challenge Report

作者: Bin Ren, Hang Guo, Yan Shu, Jiaqi Ma, Ziteng Cui, Shuhong Liu, Guofeng Mei, Lei Sun, Zongwei Wu, Fahad Shahbaz Khan, Salman Khan, Radu Timofte, Yawei Li, Hongyuan Yu, Pufan Xu, Chen Wu, Long Peng, Jiaojiao Yi, Siyang Yi, Yuning Cui, Jingyuan Xia, Xing Mou, Keji He, Jinlin Wu, Zongang Gao, Sen Yang, Rui Zheng, Fengguo Li, Yecheng Lei, Wenkai Min, Jie Liu, Keye Cao, Shubham Sharma, Manish Prasad, Haobo Li, Matin Fazel, Abdelhak Bentaleb, Rui Chen, Shurui Shi, Zitao Dai, Qingliang Liu, Yang Cheng, Jing Hu, Xuan Zhang, Rui Ding, Tingyi Zhang, Hui Deng, Mengyang Wang, Fulin Liu, Jing Wei, Qian Wang, Hongying Liu, Mingyang Li, Guanglu Dong, Zheng Yang, Chao Ren, Hongbo Fang, Lingxuan Li, Lin Si, Pan Gao, Moncef Gabbouj, Watchara Ruangsang, Supavadee Aramvith 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03198v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文是关于高效单图像超分辨率（super-resolution）的挑战赛报告，专注于计算机视觉领域的图像处理技术，特别是网络设计以降低计算成本（如运行时、参数、FLOPs）同时保持图像质量（PSNR）。论文内容涉及传统深度学习模型（如CNN）在特定视觉任务中的应用，但未提及任何大语言模型（LLMs）、大模型技术原理（如MoE、Scaling Laws、微调方法等）、大模型应用场景（如推理、代理、对齐）或科学AI应用（如生物信息学）。所有评分关键词均与大模型或深度学习在科学领域的应用无关，因此相关度均为0。

!!! tip deepseek-chat TL;DR

该论文总结了NTIRE 2026高效单图像超分辨率挑战赛，旨在设计降低计算成本（如运行时、参数、FLOPs）同时保持图像质量（PSNR约26.90-26.99 dB）的网络，并报告了15支参赛团队的最新结果。

摘要翻译

本文回顾了NTIRE 2026高效单图像超分辨率挑战赛，重点介绍了所提出的解决方案与结果。该挑战赛的目标是设计一种网络，在降低一个或多个方面（如运行时间、参数量和浮点运算次数）的同时，在DIV2K_LSDIR_valid数据集上保持峰值信噪比（PSNR）约为26.90 dB，在DIV2K_LSDIR_test数据集上保持约26.99 dB。本次挑战赛共有95名注册参与者，其中15支团队提交了有效方案。这些成果衡量了高效单图像超分辨率领域的当前最优水平。

摘要 (Abstract)

This paper reviews the NTIRE 2026 challenge on efficient single-image super-resolution with a focus on the proposed solutions and results. The aim of this challenge is to devise a network that reduces one or several aspects, such as runtime, parameters, and FLOPs, while maintaining PSNR of around 26.90 dB on the DIV2K_LSDIR_valid dataset, and 26.99 dB on the DIV2K_LSDIR_test dataset. The challenge had 95 registered participants, and 15 teams made valid submissions. They gauge the state-of-the-art results for efficient single-image super-resolution.

关键词: efficient super-resolution, single-image super-resolution, network design, runtime reduction, parameter reduction, FLOPs reduction, PSNR, DIV2K_LSDIR dataset

136. ❌ The Compression Gap: Why Discrete Tokenization Limits Vision-Language-Action Model Scaling

作者: Takuya Shiba 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03191v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究Vision-Language-Action (VLA)模型在机器人操作任务中的扩展问题，聚焦于离散化动作表示如何成为信息瓶颈并限制模型性能提升。所有评分关键词均针对大语言模型(LLMs)及相关技术（如MoE、RLHF、RAG、量化等），而本文研究的是视觉-语言-动作模型，属于机器人学习/物理AI领域，未涉及LLMs技术原理或应用。虽然属于AI for Science的广义范畴（物理AI），但具体内容与生物信息学/化学信息学无关，且未使用LLMs相关技术，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

研究发现，在视觉-语言-动作模型中，当动作被离散化为固定容量的token时，离散化代码本会成为信息瓶颈，阻碍视觉编码器改进带来的性能提升，揭示了物理AI扩展需要识别管道中的信息瓶颈而非单纯增加模型规模。

摘要翻译

通过提升视觉编码器来扩展视觉-语言-动作（VLA）模型，有望提升下游操作任务的性能——正如其在视觉-语言建模中所展现的那样。然而，我们发现当动作被表示为离散标记时，这一预期并不成立，并通过一个信息论原理——我们称之为“压缩间隙”——解释了其原因：在任何视觉运动管道中，扩展行为受制于最严格的信息瓶颈所在位置。当动作是连续的（例如扩散策略），视觉编码器成为关键约束，提升它可直接改善性能。当动作通过固定容量的码本（例如OAT）被离散化时，码本则成为关键约束，编码器的改进无法跨越此瓶颈传递——无论上游表征多么丰富。我们在LIBERO基准测试中通过三条证据验证了这一原理：一项析因实验表明，编码器升级使扩散策略性能提升超过21个百分点，而OAT在不同模型规模下的增益则大幅减弱；一项涵盖四种编码器的编码器质量梯度实验证实，扩散策略随编码器质量单调提升，而OAT则保持平坦；以及一项码本规模实验证明，放宽码本容量可部分恢复编码器敏感性，为瓶颈假设提供了因果证据。我们的研究揭示，物理人工智能的扩展需要识别管道中的信息瓶颈所在，而非简单地统一增加模型或数据规模。

摘要 (Abstract)

Scaling Vision-Language-Action (VLA) models by upgrading the vision encoder is expected to improve downstream manipulation performance–as it does in vision-language modeling. We show that this expectation fails when actions are represented as discrete tokens, and explain why through an information-theoretic principle we call the Compression Gap: in any visuomotor pipeline, scaling behavior is governed by the location of the tightest information bottleneck. When actions are continuous (e.g., Diffusion Policy), the vision encoder is the binding constraint, and upgrading it directly improves performance. When actions are discretized through a fixed-capacity codebook (e.g., OAT), the codebook becomes the binding constraint, and encoder improvements cannot propagate past it–regardless of how rich the upstream representation is. We validate this principle on the LIBERO benchmark with three lines of evidence: a factorial experiment showing that encoder upgrades improve Diffusion Policy by over 21 percentage points while OAT gains are substantially attenuated across model scales; an encoder quality gradient across four encoders confirming that Diffusion Policy tracks encoder quality monotonically while OAT remains flat; and a codebook size experiment demonstrating that relaxing codebook capacity partially recovers encoder sensitivity, providing causal evidence for the bottleneck hypothesis. Our findings reveal that scaling in Physical AI requires identifying where information bottlenecks lie in the pipeline, rather than uniformly increasing model or data size.

关键词: Vision-Language-Action models, discrete tokenization, information bottleneck, Compression Gap, scaling behavior, codebook capacity, physical AI, LIBERO benchmark

137. ❌ SFFNet: Synergistic Feature Fusion Network With Dual-Domain Edge Enhancement for UAV Image Object Detection

作者: Wenfeng Zhang, Jun Ni, Yue Meng, Xiaodong Pei, Wei Hu, Qibing Qin, Lei Huang 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03176v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于无人机图像目标检测的计算机视觉任务，提出了一种名为SFFNet的协同特征融合网络，包含多尺度动态双域耦合模块和协同特征金字塔网络等技术。论文内容完全围绕传统计算机视觉架构（如CNN、特征金字塔、边缘检测）展开，未涉及任何大语言模型、深度学习技术原理创新或AI for Science应用。所有评分关键词均与大模型、深度学习技术原理或科学AI应用相关，而本文是纯粹的计算机视觉目标检测研究，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对无人机图像中背景噪声复杂和目标尺度不平衡导致的目标检测难题，提出了一种具有双域边缘增强的协同特征融合网络SFFNet，在VisDrone和UAVDT数据集上取得了优异性能。

摘要翻译

无人机影像中的目标检测仍是一项极具挑战性的任务，其主要难点源于背景噪声的复杂性与目标尺度的不均衡性。传统方法往往难以从复杂背景中有效分离目标，且未能充分利用图像中蕴含的丰富多尺度信息。为解决这些问题，我们提出了一种专为无人机影像目标检测设计的、具备双域边缘增强的协同特征融合网络（Synergistic Feature Fusion Network, SFFNet）。首先，设计了多尺度动态双域耦合（Multi-scale Dynamic Dual-domain Coupling, MDDC）模块。该模块引入了一种在频域与空间域双驱动的边缘提取架构，能够有效将多尺度目标边缘从背景噪声中解耦。其次，为增强模型颈部在几何信息与语义信息两方面的表征能力，提出了协同特征金字塔网络（Synergistic Feature Pyramid Network, SFPN）。SFPN利用线性可变形卷积自适应捕捉不规则目标形状，并通过设计的广域感知模块（Wide-area Perception Module, WPM）建立目标周围的远程上下文关联。此外，为适应不同应用场景或资源受限环境，我们设计了六种不同规模（N/S/M/B/L/X）的检测器。在两个具有挑战性的航空影像数据集（VisDrone与UAVDT）上的实验表明，SFFNet-X取得了卓越性能，分别达到36.8 AP与20.6 AP。轻量化模型（N/S）亦在检测精度与参数量效率之间保持了良好平衡。代码将在https://github.com/CQNU-ZhangLab/SFFNet公开。

摘要 (Abstract)

Object detection in unmanned aerial vehicle (UAV) images remains a highly challenging task, primarily caused by the complexity of background noise and the imbalance of target scales. Traditional methods easily struggle to effectively separate objects from intricate backgrounds and fail to fully leverage the rich multi-scale information contained within images. To address these issues, we have developed a synergistic feature fusion network (SFFNet) with dual-domain edge enhancement specifically tailored for object detection in UAV images. Firstly, the multi-scale dynamic dual-domain coupling (MDDC) module is designed. This component introduces a dual-driven edge extraction architecture that operates in both the frequency and spatial domains, enabling effective decoupling of multi-scale object edges from background noise. Secondly, to further enhance the representation capability of the model’s neck in terms of both geometric and semantic information, a synergistic feature pyramid network (SFPN) is proposed. SFPN leverages linear deformable convolutions to adaptively capture irregular object shapes and establishes long-range contextual associations around targets through the designed wide-area perception module (WPM). Moreover, to adapt to the various applications or resource-constrained scenarios, six detectors of different scales (N/S/M/B/L/X) are designed. Experiments on two challenging aerial datasets (VisDrone and UAVDT) demonstrate the outstanding performance of SFFNet-X, achieving 36.8 AP and 20.6 AP, respectively. The lightweight models (N/S) also maintain a balance between detection accuracy and parameter efficiency. The code will be available at https://github.com/CQNU-ZhangLab/SFFNet.

关键词: UAV image object detection, synergistic feature fusion network, dual-domain edge enhancement, multi-scale dynamic dual-domain coupling, synergistic feature pyramid network, linear deformable convolutions, wide-area perception module, lightweight models

138. ❌ Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model

作者: Peiyan Li, Yixiang Chen, Yuan Xu, Jiabing Yang, Xiangnan Wu, Jun Guo, Nan Sun, Long Qian, Xinghang Li, Xin Xiao, Jing Liu, Nianfeng Liu, Tao Kong, Yan Huang, Liang Wang, Tieniu Tan 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03181v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	5.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文提出MV-VDP，一种用于机器人操作的多视角视频扩散策略，核心是联合建模环境的3D时空状态，通过预测热图视频和RGB视频来对齐视频预训练与动作微调的表示格式。论文主要涉及计算机视觉、机器人学和视频生成领域，与大多数大模型技术关键词（如LLMs、MoE、RLHF等）无关。相关关键词：1) ‘Pre-training OR Continual Pre-training OR Domain Adaptation’（5分）：论文提到现有方法依赖在静态图像-文本对上预训练的骨干网络，而MV-VDP对齐了视频预训练与动作微调，涉及预训练和领域适应概念；2) ‘Post-training OR Supervised Fine-tuning OR SFT’（5分）：论文提到动作微调（action finetuning），属于微调范畴；3) ‘Mechanistic Interpretability OR Explainable AI’（5分）：论文提到MV-VDP实现可解释的操作（interpretable manipulation），涉及可解释AI；4) ‘World Models AND General World Models’（5分）：论文建模环境动态和预测未来视频，隐含世界模型概念；5) ‘AI for Science OR Bioinformatics OR Cheminformatics’（5分）：机器人操作可视为AI在科学/工程领域的应用。其他关键词如LLMs、推理方法、代理系统等与论文内容无直接关联。

!!! tip deepseek-chat TL;DR

该论文针对机器人操作中现有策略忽视3D时空环境动态的问题，提出了MV-VDP多视角视频扩散策略，通过联合预测热图视频和RGB视频来建模环境状态，实验表明该方法仅需少量演示轨迹就能实现数据高效、鲁棒且可泛化的操作，在Meta-World和真实机器人平台上优于现有方法。

摘要翻译

机器人操作需要同时理解环境的三维空间结构及其时序演化，但现有策略大多忽略其中一方面或两者。这些策略通常依赖二维视觉观测以及基于静态图像-文本对预训练的骨干网络，导致数据需求高且对环境动态理解有限。为此，我们提出MV-VDP（多视角视频扩散策略），该策略能联合建模环境的三维时空状态。其核心思想是同步预测多视角热图视频与RGB视频，这实现了两大目标：1）将视频预训练的表征格式与动作微调对齐；2）不仅指明机器人应执行何种动作，同时预测环境如何响应这些动作而演化。大量实验表明，MV-VDP能够实现数据高效、鲁棒、可泛化且可解释的操作。仅需十条演示轨迹且无需额外预训练，MV-VDP即可成功执行复杂的真实世界任务，在一系列模型超参数下表现出强鲁棒性，泛化至分布外场景，并预测逼真的未来视频。在Meta-World仿真平台与真实机器人平台上的实验证明，MV-VDP在基于视频预测、三维建模及视觉-语言-动作的各类模型中均取得更优性能，为数据高效的多任务操作确立了新的技术标杆。

摘要 (Abstract)

Robotic manipulation requires understanding both the 3D spatial structure of the environment and its temporal evolution, yet most existing policies overlook one or both. They typically rely on 2D visual observations and backbones pretrained on static image–text pairs, resulting in high data requirements and limited understanding of environment dynamics. To address this, we introduce MV-VDP, a multi-view video diffusion policy that jointly models the 3D spatio-temporal state of the environment. The core idea is to simultaneously predict multi-view heatmap videos and RGB videos, which 1) align the representation format of video pretraining with action finetuning, and 2) specify not only what actions the robot should take, but also how the environment is expected to evolve in response to those actions. Extensive experiments show that MV-VDP enables data-efficient, robust, generalizable, and interpretable manipulation. With only ten demonstration trajectories and without additional pretraining, MV-VDP successfully performs complex real-world tasks, demonstrates strong robustness across a range of model hyperparameters, generalizes to out-of-distribution settings, and predicts realistic future videos. Experiments on Meta-World and real-world robotic platforms demonstrate that MV-VDP consistently outperforms video-prediction–based, 3D-based, and vision–language–action models, establishing a new state of the art in data-efficient multi-task manipulation.

关键词: Multi-view video diffusion policy, 3D spatio-temporal modeling, Robotic manipulation, Video prediction, Data-efficient learning, Heatmap video generation, Action finetuning, Environment dynamics

139. ❌ EffiMiniVLM: A Compact Dual-Encoder Regression Framework

作者: Yin-Loon Khor, Yi-Jie Wong, Yan Chai Hum 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03172v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出EffiMiniVLM，一个紧凑的双编码器视觉-语言回归框架，用于冷启动场景下的产品质量预测。它主要涉及小型高效模型设计（MiniLM文本编码器、EfficientNet-B0图像编码器）和资源效率优化（27.7M参数、6.8 GFLOPs），与’Small Language Models OR SLMs OR On-device AI’（5分）和’Quantization OR Model Compression OR Low-bit Weights’（5分）有一定关联，因为模型强调紧凑性和效率，但未明确使用量化或压缩技术。其他关键词如大模型、MoE、训练方法、推理技术、代理系统等均未涉及，评分为0。论文属于大模型在特定领域（电子商务）的应用，但技术焦点是轻量级架构而非大模型本身，因此相关性较低。

!!! tip deepseek-chat TL;DR

该论文提出一个紧凑的视觉-语言回归框架EffiMiniVLM，用于冷启动场景下基于图像和文本的产品质量预测，在仅使用20%数据集的情况下，以27.7M参数和6.8 GFLOPs实现了与更大模型竞争的性能，并显著提高了资源效率。

摘要翻译

在冷启动场景中，基于多模态商品信息预测产品质量至关重要，此类场景下用户交互历史缺失，预测必须依赖图像与文本元数据。然而，现有的视觉-语言模型通常依赖大型架构和/或大规模外部数据集，导致计算成本高昂。为此，我们提出EffiMiniVLM——一个紧凑的双编码器视觉-语言回归框架，它集成了EfficientNet-B0图像编码器、基于MiniLM的文本编码器以及一个轻量级回归头。为提高训练样本效率，我们引入加权Huber损失函数，利用评分数量强调更可靠的样本，从而获得持续的性能提升。仅使用Amazon Reviews 2023数据集的20%进行训练，所提模型包含2770万参数，需6.8 GFLOPs运算量，却在基准测试中以最低资源成本取得了0.40的CES分数。尽管模型规模小，其仍能与显著更大的模型竞争，在达到相当性能的同时，资源效率比其他前五名方法高出约4至8倍，且是唯一不使用外部数据集的方法。进一步分析表明，仅将数据规模扩展至40%即可使我们的模型超越其他使用更大模型和数据集的方法，这凸显了该紧凑设计框架强大的可扩展性。

摘要 (Abstract)

Predicting product quality from multimodal item information is critical in cold-start scenarios, where user interaction history is unavailable and predictions must rely on images and textual metadata. However, existing vision-language models typically depend on large architectures and/or extensive external datasets, resulting in high computational cost. To address this, we propose EffiMiniVLM, a compact dual-encoder vision-language regression framework that integrates an EfficientNet-B0 image encoder and a MiniLM-based text encoder with a lightweight regression head. To improve training sample efficiency, we introduce a weighted Huber loss that leverages rating counts to emphasize more reliable samples, yielding consistent performance gains. Trained using only 20% of the Amazon Reviews 2023 dataset, the proposed model contains 27.7M parameters and requires 6.8 GFLOPs, yet achieves a CES score of 0.40 with the lowest resource cost in the benchmark. Despite its small size, it remains competitive with significantly larger models, achieving comparable performance while being approximately 4x to 8x more resource-efficient than other top-5 methods and being the only approach that does not use external datasets. Further analysis shows that scaling the data to 40% alone allows our model to overtake other methods, which use larger models and datasets, highlighting strong scalability despite the model’s compact design.

关键词: vision-language model, compact model, dual-encoder, regression framework, cold-start prediction, resource efficiency, EfficientNet-B0, MiniLM

140. ❌ CAMEO: A Conditional and Quality-Aware Multi-Agent Image Editing Orchestrator

作者: Yuhan Pu, Hao Zheng, Ziqian Mo, Hill Zhang, Tianyi Fan, Shuhong Wu, Jiaheng Wei 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03156v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文CAMEO专注于条件图像编辑的多智能体框架，研究内容涉及图像生成、编辑控制、质量评估和反馈循环，但未涉及大语言模型（LLM）技术原理、训练方法、推理优化、对齐技术、模型压缩、科学AI应用等关键词。所有关键词均与大语言模型或深度学习技术原理直接相关，而本文研究的是图像编辑的特定应用框架，与这些关键词无直接关联。

!!! tip deepseek-chat TL;DR

论文提出了一种名为CAMEO的多智能体框架，通过将条件图像编辑重构为质量感知、反馈驱动的过程，解决了现有单步生成方法缺乏显式质量控制、容易产生结构伪影和环境不一致修改的问题，在异常插入和人体姿态切换任务上相比现有方法平均胜率提高了20%。

摘要翻译

条件图像编辑旨在根据文本提示及可选的参考指导对源图像进行修改。此类编辑在需要严格结构控制的场景中至关重要（例如驾驶场景中的异常物体插入和复杂人体姿态变换）。尽管近期大规模编辑模型（如Seedream、Nano Banana等）取得了进展，但大多数方法仍依赖于单步生成范式。这种范式通常缺乏显式的质量控制，可能导致图像与原始内容过度偏离，并频繁产生结构伪影或与环境不一致的修改，往往需要人工调整提示词才能获得可接受的结果。我们提出\textbf{CAMEO}，一种结构化多智能体框架，将条件编辑重新定义为质量感知、反馈驱动的过程，而非单次生成任务。CAMEO将编辑分解为规划、结构化提示、假设生成和自适应参考锚定等协同阶段，仅在任务复杂度需要时才调用外部指导。为克服现有方法缺乏内在质量控制的缺陷，评估机制被直接嵌入编辑循环中。中间结果通过结构化反馈进行迭代优化，形成一个逐步修正结构与上下文不一致性的闭环流程。我们在异常物体插入和人体姿态切换任务上评估CAMEO。在多种强效编辑骨干模型和独立评估模型的测试中，相较于多个先进模型，CAMEO平均持续获得20%的胜率提升，证明了其在条件图像编辑中增强的鲁棒性、可控性和结构可靠性。

摘要 (Abstract)

Conditional image editing aims to modify a source image according to textual prompts and optional reference guidance. Such editing is crucial in scenarios requiring strict structural control (i.e., anomaly insertion in driving scenes and complex human pose transformation). Despite recent advances in large-scale editing models (i.e., Seedream, Nano Banana, etc), most approaches rely on single-step generation. This paradigm often lacks explicit quality control, may introduce excessive deviation from the original image, and frequently produces structural artifacts or environment-inconsistent modifications, typically requiring manual prompt tuning to achieve acceptable results. We propose \textbf{CAMEO}, a structured multi-agent framework that reformulates conditional editing as a quality-aware, feedback-driven process rather than a one-shot generation task. CAMEO decomposes editing into coordinated stages of planning, structured prompting, hypothesis generation, and adaptive reference grounding, where external guidance is invoked only when task complexity requires it. To overcome the lack of intrinsic quality control in existing methods, evaluation is embedded directly within the editing loop. Intermediate results are iteratively refined through structured feedback, forming a closed-loop process that progressively corrects structural and contextual inconsistencies. We evaluate CAMEO on anomaly insertion and human pose switching tasks. Across multiple strong editing backbones and independent evaluation models, CAMEO consistently achieves 20% more win rate on average compared to multiple state-of-the-art models, demonstrating improved robustness, controllability, and structural reliability in conditional image editing.

关键词: conditional image editing, multi-agent framework, quality-aware, feedback-driven, structural control, anomaly insertion, human pose transformation, closed-loop process

141. ❌ SD-FSMIS: Adapting Stable Diffusion for Few-Shot Medical Image Segmentation

作者: Meihua Li, Yang Zhang, Weizhao He, Hu Qu, Yisong Li 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03134v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文SD-FSMIS专注于将预训练的Stable Diffusion模型适配到医学图像分割任务，属于AI for Science（生物信息学/医学影像）领域，与关键词’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。论文涉及预训练模型的适配，与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（8分），因为使用了预训练的Stable Diffusion模型并进行领域适配。其他关键词主要涉及大语言模型（LLM）技术、推理、对齐、优化等，而本文研究的是扩散模型在医学图像分割的应用，与这些关键词无直接关联，故评0分。

!!! tip deepseek-chat TL;DR

该论文提出SD-FSMIS框架，通过适配预训练的Stable Diffusion模型来解决少样本医学图像分割问题，在标准设置和跨域场景中均表现出竞争性的性能和优秀的泛化能力。

摘要翻译

少样本医学图像分割（FSMIS，Few-Shot Medical Image Segmentation）旨在仅利用极少量标注样本对医学图像中的新目标类别进行分割，以应对医学影像领域中普遍存在的数据稀缺和域偏移等关键挑战。尽管扩散模型（DM，Diffusion Models）在视觉任务中表现出色，但其在FSMIS中的应用潜力在很大程度上尚未得到探索。我们认为，大规模扩散模型学习到的丰富视觉先验，为构建更鲁棒且数据高效的分割方法提供了强大基础。本文提出SD-FSMIS，这是一个新颖的框架，旨在将强大的预训练Stable Diffusion（SD）模型有效适配于FSMIS任务。我们通过引入两个关键组件——支持集-查询集交互（SQI，Support-Query Interaction）和视觉-文本条件转换器（VTCT，Visual-to-Textual Condition Translator），重新利用其条件生成架构。具体而言，SQI提供了一种直接而强大的方式，使SD能够适应FSMIS范式。VTCT模块将来自支持集的视觉线索转换为隐式文本嵌入，用以引导扩散模型，从而实现对生成过程的精确条件控制。大量实验表明，在标准设置下，SD-FSMIS相比现有先进方法取得了具有竞争力的结果。令人惊讶的是，它在更具挑战性的跨域场景中也展现出优异的泛化能力。这些发现凸显了适配大规模生成模型在推动数据高效且鲁棒的医学图像分割方面的巨大潜力。

摘要 (Abstract)

Few-Shot Medical Image Segmentation (FSMIS) aims to segment novel object classes in medical images using only minimal annotated examples, addressing the critical challenges of data scarcity and domain shifts prevalent in medical imaging. While Diffusion Models (DM) excel in visual tasks, their potential for FSMIS remains largely unexplored. We propose that the rich visual priors learned by large-scale DMs offer a powerful foundation for a more robust and data-efficient segmentation approach. In this paper, we introduce SD-FSMIS, a novel framework designed to effectively adapt the powerful pre-trained Stable Diffusion (SD) model for the FSMIS task. Our approach repurposes its conditional generative architecture by introducing two key components: a Support-Query Interaction (SQI) and a Visual-to-Textual Condition Translator (VTCT). Specifically, SQI provides a straightforward yet powerful means of adapting SD to the FSMIS paradigm. The VTCT module translates visual cues from the support set into an implicit textual embedding that guides the diffusion model, enabling precise conditioning of the generation process. Extensive experiments demonstrate that SD-FSMIS achieves competitive results compared to state-of-the-art methods in standard settings. Surprisingly, it also demonstrated excellent generalization ability in more challenging cross-domain scenarios. These findings highlight the immense potential of adapting large-scale generative models to advance data-efficient and robust medical image segmentation.

关键词: Stable Diffusion, Few-Shot Medical Image Segmentation, Diffusion Models, Domain Adaptation, Cross-domain Generalization, Visual Priors, Support-Query Interaction, Visual-to-Textual Condition Translator

142. ❌ SCC-Loc: A Unified Semantic Cascade Consensus Framework for UAV Thermal Geo-Localization

作者: Xiaoran Zhang, Yu Liu, Jinyu Liang, Kangqiushi Li, Zhiwei Huang, Huaxin Xiao 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03120v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于无人机热成像地理定位的计算机视觉任务，提出了一种语义级联共识框架（SCC-Loc），并构建了Thermal-UAV数据集。论文的核心技术涉及跨模态匹配、语义引导对齐、几何一致性过滤和物理约束优化，属于计算机视觉和遥感应用领域。所有关键词均与大模型、深度学习技术原理或科学AI应用直接相关，但论文未涉及任何大模型（LLM）、深度学习架构（如MoE）、训练方法（如预训练、微调、对齐）、推理优化、代理系统或模型压缩等技术。仅最后一个关键词“AI for Science OR Bioinformatics OR Cheminformatics”有一定关联，因为论文将AI应用于无人机地理定位（可视为科学或工程应用），但并非核心生物信息学或化学信息学，因此给予5分（有一定关联）。其他关键词完全无关，均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种统一的语义级联共识框架（SCC-Loc），用于解决无人机在GNSS拒止环境中的跨模态热成像地理定位问题，通过语义引导对齐、级联过滤和共识优化，在自建数据集上实现了9.37米的平均定位误差，比基线提高了7.6倍精度。

摘要翻译

跨模态热感地理定位为全球导航卫星系统拒止环境下的无人机提供了全天候的鲁棒性解决方案。然而，显著的热感-可见光模态差异引入了严重的特征模糊性，系统性地破坏了传统的由粗到精配准流程。为突破此瓶颈，我们提出SCC-Loc——一个统一的语义级联共识定位框架。通过在全球检索与MINIMA$_{\text{RoMa}}$匹配中共享单一DINOv2主干网络，该框架最小化了内存占用，并实现了零样本、高精度的绝对位置估计。具体而言，我们通过引入三个紧密协同的组件应对模态模糊性问题：首先，设计语义引导视场对齐模块，自适应优化卫星图像裁剪区域，有效校正初始空间偏差；其次，开发级联空间自适应纹理-结构过滤机制，显式增强几何一致性，从而消除密集跨模态异常匹配；最后，提出共识驱动的可靠性感知位置选择策略，通过物理约束位姿优化的协同作用推导最优解。针对数据稀缺问题，我们构建了Thermal-UAV综合数据集，提供11,890个多样化热感查询图像，并配以大规模卫星正射影像及空间对齐的数字表面模型作为参考。大量实验表明，SCC-Loc确立了新的技术标杆，将平均定位误差降低至9.37米，在严格的5米误差阈值范围内，其精度较最强基线模型提升7.6倍。代码与数据集已开源：https://github.com/FloralHercules/SCC-Loc。

摘要 (Abstract)

Cross-modal Thermal Geo-localization (TG) provides a robust, all-weather solution for Unmanned Aerial Vehicles (UAVs) in Global Navigation Satellite System (GNSS)-denied environments. However, profound thermal-visible modality gaps introduce severe feature ambiguity, systematically corrupting conventional coarse-to-fine registration. To dismantle this bottleneck, we propose SCC-Loc, a unified Semantic-Cascade-Consensus localization framework. By sharing a single DINOv2 backbone across global retrieval and MINIMA$_{\text{RoMa}}$ matching, it minimizes memory footprint and achieves zero-shot, highly accurate absolute position estimation. Specifically, we tackle modality ambiguity by introducing three cohesive components. First, we design the Semantic-Guided Viewport Alignment (SGVA) module to adaptively optimize satellite crop regions, effectively correcting initial spatial deviations. Second, we develop the Cascaded Spatial-Adaptive Texture-Structure Filtering (C-SATSF) mechanism to explicitly enforce geometric consistency, thereby eradicating dense cross-modal outliers. Finally, we propose the Consensus-Driven Reliability-Aware Position Selection (CD-RAPS) strategy to derive the optimal solution through a synergy of physically constrained pose optimization. To address data scarcity, we construct Thermal-UAV, a comprehensive dataset providing 11,890 diverse thermal queries referenced against a large-scale satellite ortho-photo and corresponding spatially aligned Digital Surface Model (DSM). Extensive experiments demonstrate that SCC-Loc establishes a new state-of-the-art, suppressing the mean localization error to 9.37 m and providing a 7.6-fold accuracy improvement within a strict 5-m threshold over the strongest baseline. Code and dataset are available at https://github.com/FloralHercules/SCC-Loc.

关键词: Thermal Geo-localization, Cross-modal Matching, Semantic Cascade Consensus, UAV Localization, Satellite Imagery, Modality Gap, Zero-shot Estimation, Thermal-UAV Dataset

143. ❌ Salt: Self-Consistent Distribution Matching with Cache-Aware Training for Fast Video Generation

作者: Xingtong Ge, Yi Zhang, Yushi Huang, Dailan He, Xiahong Wang, Bingqi Ma, Guanglu Song, Yu Liu, Jun Zhang 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03118v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于视频生成模型的蒸馏技术（SC-DMD和Cache-Distribution-Aware训练），以降低推理成本（2-4 NFEs）并提高实时视频生成质量。所有关键词均与大语言模型（LLMs）相关，而本文研究的是视频生成模型（如Wan 2.1、Self Forcing），属于计算机视觉领域，与大语言模型无直接关联。因此，所有关键词得分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种自洽分布匹配蒸馏方法（SC-DMD）和缓存感知训练，以在极低推理预算下（2-4 NFEs）提高视频生成质量，并在非自回归和自回归视频生成模型中验证了其有效性。

摘要翻译

将视频生成模型蒸馏至极低推理成本（例如2-4次噪声函数评估，NFEs）对于实时部署至关重要，但仍具挑战性。轨迹式一致性蒸馏在复杂视频动态下往往趋于保守，导致生成结果过度平滑且运动表现薄弱。分布匹配蒸馏（DMD）能够恢复清晰、模式明确的样本，但其局部训练信号未显式规范去噪更新在时间步之间的组合方式，使得组合展开过程易出现漂移。为克服这一难题，我们提出自一致分布匹配蒸馏（SC-DMD），该方法显式规范连续去噪更新的端点一致组合。针对实时自回归视频生成，我们进一步将键值缓存（KV cache）视为质量参数化条件，并提出缓存分布感知训练。该训练方案在多步展开上应用SC-DMD，并引入以缓存为条件的特征对齐目标，引导低质量输出向高质量参考对齐。在非自回归主干模型（如Wan~2.1）和自回归实时范式（如Self Forcing）上的大量实验表明，我们的方法（命名为\textbf{Salt}）能持续提升低NFE视频生成质量，同时保持与多种KV缓存内存机制的兼容性。源代码将在\href{https://github.com/XingtongGe/Salt}{https://github.com/XingtongGe/Salt}发布。

摘要 (Abstract)

Distilling video generation models to extremely low inference budgets (e.g., 2–4 NFEs) is crucial for real-time deployment, yet remains challenging. Trajectory-style consistency distillation often becomes conservative under complex video dynamics, yielding an over-smoothed appearance and weak motion. Distribution matching distillation (DMD) can recover sharp, mode-seeking samples, but its local training signals do not explicitly regularize how denoising updates compose across timesteps, making composed rollouts prone to drift. To overcome this challenge, we propose Self-Consistent Distribution Matching Distillation (SC-DMD), which explicitly regularizes the endpoint-consistent composition of consecutive denoising updates. For real-time autoregressive video generation, we further treat the KV cache as a quality parameterized condition and propose Cache-Distribution-Aware training. This training scheme applies SC-DMD over multi-step rollouts and introduces a cache-conditioned feature alignment objective that steers low-quality outputs toward high-quality references. Across extensive experiments on both non-autoregressive backbones (e.g., Wan~2.1) and autoregressive real-time paradigms (e.g., Self Forcing), our method, dubbed \textbf{Salt}, consistently improves low-NFE video generation quality while remaining compatible with diverse KV-cache memory mechanisms. Source code will be released at \href{https://github.com/XingtongGe/Salt}{https://github.com/XingtongGe/Salt}.

关键词: video generation, consistency distillation, distribution matching, KV cache, inference acceleration, autoregressive generation, low-NFE, real-time deployment

144. ❌ Revealing Physical-World Semantic Vulnerabilities: Universal Adversarial Patches for Infrared Vision-Language Models

作者: Chengyin Hu, Yuxian Dong, Yikun Guo, Xiang Chen, Junqi Wu, Jiahuan Long, Yiwei Wei, Tingsong Jiang, Wen Yao 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03117v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究红外视觉语言模型（IR-VLMs）的对抗性攻击，属于计算机视觉和模型安全领域，与提供的关键词（主要关注大模型技术原理、训练方法、推理优化、对齐、应用等）无直接关联。论文未涉及任何关键词中的技术或概念，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一个名为UCGP的通用物理对抗补丁框架，用于揭示和攻击红外视觉语言模型在真实世界中的语义理解漏洞，实验表明该攻击能有效破坏跨模态语义对齐并保持跨模型、跨数据集的泛化能力。

摘要翻译

红外视觉语言模型（IR-VLMs）已成为低能见度环境下多模态感知的一种有前景的范式，但其对抗攻击的鲁棒性在很大程度上仍未得到探索。现有的对抗性补丁方法主要针对闭集设置下的RGB模型设计，难以直接适用于红外视觉语言模型开放式的语义理解与物理部署需求。为弥补这一差距，我们提出了通用曲网格补丁（Universal Curved-Grid Patch, UCGP），一种面向红外视觉语言模型的通用物理对抗补丁框架。UCGP集成了曲网格参数化（Curved-Grid Mesh, CGM）方法，以生成连续、低频且可部署的补丁，并结合了统一的表征驱动优化目标，该目标旨在促进表征子空间偏离、拓扑结构破坏与隐蔽性。为提升真实世界部署及域偏移下的鲁棒性，我们进一步引入了元差分进化算法与EOT增强的薄板样条变形建模。UCGP不直接操纵标签或提示词，而是通过干扰视觉表征空间来削弱跨模态语义对齐。大量实验表明，UCGP能持续破坏多种红外视觉语言模型架构的语义理解能力，同时保持跨模型可迁移性、跨数据集泛化性、真实世界物理有效性以及对防御措施的鲁棒性。这些发现揭示了当前红外多模态系统中一个先前被忽视的鲁棒性漏洞。

摘要 (Abstract)

Infrared vision-language models (IR-VLMs) have emerged as a promising paradigm for multimodal perception in low-visibility environments, yet their robustness to adversarial attacks remains largely unexplored. Existing adversarial patch methods are mainly designed for RGB-based models in closed-set settings and are not readily applicable to the open-ended semantic understanding and physical deployment requirements of infrared VLMs. To bridge this gap, we propose Universal Curved-Grid Patch (UCGP), a universal physical adversarial patch framework for IR-VLMs. UCGP integrates Curved-Grid Mesh (CGM) parameterization for continuous, low-frequency, and deployable patch generation with a unified representation-driven objective that promotes subspace departure, topology disruption, and stealth. To improve robustness under real-world deployment and domain shift, we further incorporate Meta Differential Evolution and EOT-augmented TPS deformation modeling. Rather than manipulating labels or prompts, UCGP directly disrupts the visual representation space, weakening cross-modal semantic alignment. Extensive experiments demonstrate that UCGP consistently compromises semantic understanding across diverse IR-VLM architectures while maintaining cross-model transferability, cross-dataset generalization, real-world physical effectiveness, and robustness against defenses. These findings reveal a previously overlooked robustness vulnerability in current infrared multimodal systems.

关键词: Infrared Vision-Language Models, Adversarial Patches, Physical-World Attacks, Cross-modal Semantic Alignment, Robustness Vulnerability, Universal Curved-Grid Patch, Representation Space Disruption, Multimodal Perception

145. ❌ ARIQA-3DS: A Stereoscopic Image Quality Assessment Dataset for Realistic Augmented Reality

作者: Aymen Sekhri, Seyed Ali Amirshahi, Mohamed-Chaker Larabi 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03112v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于增强现实（AR）中的立体图像质量评估数据集创建，研究内容涉及计算机视觉、图像处理和人机交互领域，与所有评分关键词（均围绕大模型、深度学习技术原理及其应用）完全无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对增强现实中真实与虚拟层视觉混淆问题，创建了首个大型立体AR图像质量评估数据集ARIQA-3DS，并通过主观研究发现感知质量主要受前景退化影响且透明度水平起调节作用。

摘要翻译

随着增强现实（AR）技术向沉浸式消费级应用迈进，严格的体验质量评估变得至关重要。然而，现有数据集常缺乏生态效度，依赖单目视图或简化背景，未能捕捉真实与虚拟层之间复杂的感知交互现象（即视觉混淆）。为填补这一空白，我们提出了首个大规模立体AR图像质量评估数据集ARIQA-3DS。该数据集包含1,200个AR视口，在可控透明度与退化条件下，将高分辨率立体全景实景捕捉与多样化的增强前景相融合。我们通过视频透视头戴式显示器对36名参与者开展了全面主观研究，同时收集了质量评分与模拟器眩晕指标。分析表明：感知质量主要受前景退化驱动，并受透明度水平调节；而眼球运动与定向障碍症状在观看过程中呈现渐进式但可控的增长。ARIQA-3DS将公开发布，为开发下一代AR质量评估模型提供综合基准。

摘要 (Abstract)

As Augmented Reality (AR) technologies advance towards immersive consumer adoption, the need for rigorous Quality of Experience (QoE) assessment becomes critical. However, existing datasets often lack ecological validity, relying on monocular viewing or simplified backgrounds that fail to capture the complex perceptual interplay, termed visual confusion, between real and virtual layers. To address this gap, we present ARIQA-3DS, the first large stereoscopic AR Image Quality Assessment dataset. Comprising 1,200 AR viewports, the dataset fuses high-resolution stereoscopic omnidirectional captures of real-world scenes with diverse augmented foregrounds under controlled transparency and degradation conditions. We conducted a comprehensive subjective study with 36 participants using a video see-through head-mounted display, collecting both quality ratings and simulator-sickness indicators. Our analysis reveals that perceived quality is primarily driven by foreground degradations and modulated by transparency levels, while oculomotor and disorientation symptoms show a progressive but manageable increase during viewing. ARIQA-3DS will be publicly released to serve as a comprehensive benchmark for developing next-generation AR quality assessment models.

关键词: Augmented Reality, Stereoscopic Image Quality Assessment, Quality of Experience, Visual Confusion, Subjective Study, Foreground Degradations, Transparency Levels, Simulator-sickness

146. ❌ MI-Pruner: Crossmodal Mutual Information-guided Token Pruner for Efficient MLLMs

作者: Jiameng Li, Aleksei Tiulpin, Matthew B. Blaschko 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03072v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究多模态大语言模型（MLLMs）的视觉token剪枝方法以提高推理效率，与’Large Language Models’高度相关（10分），因为MLLMs是LLMs的扩展；与’Quantization OR Model Compression’有一定关联（5分），因为token剪枝是一种模型压缩技术；与’Speculative Decoding OR Inference Acceleration’有一定关联（5分），因为剪枝旨在加速推理；其他关键词如MoE、SFT、RAG等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于互信息的跨模态token剪枝方法（MI-Pruner），用于高效的多模态大语言模型推理，实验表明其优于基于注意力的剪枝方法且延迟最小。

摘要翻译

对于多模态大语言模型（MLLMs）而言，视觉信息相较于文本相对稀疏。因此，为提升推理效率，视觉剪枝研究应运而生。现有方法通常基于视觉编码器或大语言模型解码器中的注意力分数来衡量令牌重要性，进而选择注意力分数较高的视觉令牌并剪除其余部分。本文探索了一种不同且更具针对性的方法。我们不依赖特定机制信号，而是直接在视觉与文本特征交互之前，计算二者之间的互信息（Mutual Information, MI）。这使得我们能够在特征层面显式衡量跨模态依赖性。我们提出的MI-Pruner方法简单、高效且无需侵入模型，既不需要访问内部注意力图，也无需修改模型架构。实验结果表明，我们的方法在保持极低延迟的同时，性能优于以往基于注意力的剪枝方法。

摘要 (Abstract)

For multimodal large language models (MLLMs), visual information is relatively sparse compared with text. As a result, research on visual pruning emerges for efficient inference. Current approaches typically measure token importance based on the attention scores in the visual encoder or in the LLM decoder, then select visual tokens with high attention scores while pruning others. In this paper, we pursue a different and more surgical approach. Instead of relying on mechanism-specific signals, we directly compute Mutual Information (MI) between visual and textual features themselves, prior to their interaction. This allows us to explicitly measure crossmodal dependency at the feature levels. Our MI-Pruner is simple, efficient and non-intrusive, requiring no access to internal attention maps or architectural modifications. Experimental results demonstrate that our approach outperforms previous attention-based pruning methods with minimal latency.

关键词: Multimodal Large Language Models, Token Pruning, Mutual Information, Efficient Inference, Visual Tokens, Crossmodal Dependency, Attention-based Pruning, MLLMs

147. ❌ SparseSplat: Towards Applicable Feed-Forward 3D Gaussian Splatting with Pixel-Unaligned Prediction

作者: Zicheng Zhang, Xiangting Meng, Ke Wu, Wenchao Ding 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03069v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉和3D重建领域，提出了一种名为SparseSplat的feed-forward 3D高斯泼溅（3DGS）模型，旨在通过自适应调整高斯密度来生成紧凑的3DGS地图，并设计了一个专用的点云网络来解决感受野不匹配问题。论文的核心内容涉及3D重建、渲染优化、点云处理和计算机视觉，与所有评分关键词（均围绕大语言模型、深度学习技术原理及其在科学领域的应用，如生物信息学）的主题完全无关。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对现有feed-forward 3D高斯泼溅方法生成的空间均匀且冗余的3DGS地图问题，提出了SparseSplat模型，通过熵基概率采样自适应调整高斯密度，并设计专用点云网络，实现了仅用22%的高斯数达到最先进渲染质量，仅用1.5%的高斯数保持合理渲染质量。

摘要翻译

前馈式三维高斯泼溅（3DGS）技术的最新进展显著提升了渲染质量。然而，以往前馈式3DGS方法生成的空间均匀且高度冗余的3DGS地图限制了其在下游重建任务中的应用。我们提出SparseSplat，这是首个能够根据场景结构和局部区域信息丰富度自适应调整高斯分布密度的前馈式3DGS模型，可生成高度紧凑的3DGS地图。为实现这一目标，我们提出了基于信息熵的概率采样方法，在纹理稀疏区域生成大而稀疏的高斯分布，而在信息丰富区域分配小而密集的高斯分布。此外，我们设计了一种专用的点云网络，能高效编码局部上下文信息并将其解码为3DGS属性，从而解决了通用3DGS优化流程与前馈模型之间的感受野不匹配问题。大量实验结果表明，SparseSplat仅需22%的高斯分布即可达到最先进的渲染质量，且仅使用1.5%的高斯分布仍能保持合理的渲染效果。项目页面：https://victkk.github.io/SparseSplat-page/。

摘要 (Abstract)

Recent progress in feed-forward 3D Gaussian Splatting (3DGS) has notably improved rendering quality. However, the spatially uniform and highly redundant 3DGS map generated by previous feed-forward 3DGS methods limits their integration into downstream reconstruction tasks. We propose SparseSplat, the first feed-forward 3DGS model that adaptively adjusts Gaussian density according to scene structure and information richness of local regions, yielding highly compact 3DGS maps. To achieve this, we propose entropy-based probabilistic sampling, generating large, sparse Gaussians in textureless areas and assigning small, dense Gaussians to regions with rich information. Additionally, we designed a specialized point cloud network that efficiently encodes local context and decodes it into 3DGS attributes, addressing the receptive field mismatch between the general 3DGS optimization pipeline and feed-forward models. Extensive experimental results demonstrate that SparseSplat can achieve state-of-the-art rendering quality with only 22% of the Gaussians and maintain reasonable rendering quality with only 1.5% of the Gaussians. Project page: https://victkk.github.io/SparseSplat-page/.

关键词: 3D Gaussian Splatting, feed-forward model, sparse Gaussians, entropy-based sampling, point cloud network, rendering quality, 3D reconstruction, adaptive density

148. ❌ Gram-MMD: A Texture-Aware Metric for Image Realism Assessment

作者: Joé Napolitano, Pascal Nguyen 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03064v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的图像真实性评估，提出了一种基于Gram矩阵和MMD的纹理感知度量方法。论文内容涉及生成模型评估、特征提取、图像质量度量等，但完全不涉及大语言模型、深度学习技术原理创新、AI for Science等关键词所涵盖的领域。所有关键词均与大模型技术、科学AI应用、深度学习创新等主题相关，而本文是纯粹的计算机视觉图像评估研究，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Gram-MMD的图像真实性评估度量方法，通过Gram矩阵捕捉预训练网络激活中的纹理特征，解决了现有语义级度量可能忽略细粒度纹理信息的问题，并在多个数据集上验证了其优于现有方法的性能。

摘要翻译

评估生成图像的真实性仍然是生成建模领域的核心挑战。现有分布度量指标如弗雷歇起始距离（FID）和CLIP最大均值差异（CMMD）在语义层面比较特征分布，但可能忽略对区分真实与生成图像至关重要的细粒度纹理信息。本文提出Gram矩阵最大均值差异（GMMD），这是一种通过预训练主干网络中间激活层计算的格拉姆矩阵来捕捉特征图间相关性的真实性度量方法。通过提取这些对称格拉姆矩阵的上三角部分，并测量真实图像锚点分布与评估分布之间的最大均值差异，GMMD生成的表征能以比全局嵌入更精细的粒度编码纹理与结构特征。为确定该度量的超参数，我们采用基于MS-COCO图像受控退化的元度量协议，通过斯皮尔曼等级相关系数与肯德尔τ系数测量单调性。我们在KADID-10k数据库和RAISE真实性评估数据集上使用多种主干架构进行实验，包括DINOv2、DC-AE、Stable Diffusion的VAE编码器、VGG19以及LPIPS中的AlexNet主干网络等。在跨域驾驶场景（KITTI/Virtual KITTI/Stanford Cars）的实验中，我们发现CMMD因其语义偏差可能错误地将真实图像排序在合成图像之后，而GMMD能保持正确的排序。结果表明，GMMD能够捕捉与现有语义层面度量指标互补的信息。

摘要 (Abstract)

Evaluating the realism of generated images remains a fundamental challenge in generative modeling. Existing distributional metrics such as the Frechet Inception Distance (FID) and CLIP-MMD (CMMD) compare feature distributions at a semantic level but may overlook fine-grained textural information that can be relevant for distinguishing real from generated images. We introduce Gram-MMD (GMMD), a realism metric that leverages Gram matrices computed from intermediate activations of pretrained backbone networks to capture correlations between feature maps. By extracting the upper-triangular part of these symmetric Gram matrices and measuring the Maximum Mean Discrepancy (MMD) between an anchor distribution of real images and an evaluation distribution, GMMD produces a representation that encodes textural and structural characteristics at a finer granularity than global embeddings. To select the hyperparameters of the metric, we employ a meta-metric protocol based on controlled degradations applied to MS-COCO images, measuring monotonicity via Spearman’s rank correlation and Kendall’s tau. We conduct experiments on both the KADID-10k database and the RAISE realness assessment dataset using various backbone architectures, including DINOv2, DC-AE, Stable Diffusion’s VAE encoder, VGG19, and the AlexNet backbone from LPIPS, among others. We also demonstrate on a cross-domain driving scenario (KITTI / Virtual KITTI / Stanford Cars) that CMMD can incorrectly rank real images as less realistic than synthetic ones due to its semantic bias, while GMMD preserves the correct ordering. Our results suggest that GMMD captures complementary information to existing semantic-level metrics.

关键词: image realism assessment, Gram matrices, Maximum Mean Discrepancy, texture-aware metric, generative modeling, feature distributions, pretrained backbone networks, semantic bias

149. ❌ Can Nano Banana 2 Replace Traditional Image Restoration Models? An Evaluation of Its Performance on Image Restoration Tasks

作者: Weixiong Sun, Xiang Yin, Chao Dong 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03061v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是图像生成模型Nano Banana 2在图像修复任务上的性能评估，属于计算机视觉和图像处理领域。所有给定的关键词均与大语言模型（LLM）及其相关技术、应用、优化方法相关，而论文完全不涉及文本、语言或大语言模型技术，因此所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文评估了通用图像生成模型Nano Banana 2在多种图像修复任务上的性能，发现通过精心设计的提示词，该模型在重建精度和感知质量上能达到与最先进修复模型相当甚至更优的效果，展现了通用生成模型作为统一图像修复工具的潜力。

摘要翻译

生成式人工智能的最新进展引发了一个问题：通用图像编辑模型能否作为图像复原的统一解决方案。本研究对Nano Banana 2在不同场景和退化类型下的图像复原能力进行了系统评估。结果表明，提示设计至关重要，其中包含明确保真度约束的简洁提示能在重建精度与感知质量间达到最佳平衡。与当前最先进的复原模型相比，Nano Banana 2在全参考指标上表现优异，同时在感知质量方面保持竞争力，用户研究进一步支持了这一结论。我们还观察到该模型在挑战性场景中（如小人脸、密集人群和严重退化）具有强大的泛化能力。然而，模型对提示表述仍较为敏感，可能需要迭代优化以获得最佳结果。总体而言，我们的研究表明通用生成模型作为统一图像复原工具具有巨大潜力，同时凸显了可控性与鲁棒性的重要性。所有测试结果已发布于https://github.com/yxyuanxiao/NanoBanana2TestOnIR。

摘要 (Abstract)

Recent advances in generative AI raise the question of whether general-purpose image editing models can serve as unified solutions for image restoration. In this work, we conduct a systematic evaluation of Nano Banana 2 for image restoration across diverse scenes and degradation types. Our results show that prompt design plays a critical role, where concise prompts with explicit fidelity constraints achieve the best trade-off between reconstruction accuracy and perceptual quality. Compared with state-of-the-art restoration models, Nano Banana 2 achieves superior performance in full-reference metrics while remaining competitive in perceptual quality, which is further supported by user studies. We also observe strong generalization in challenging scenarios, such as small faces, dense crowds, and severe degradations. However, the model remains sensitive to prompt formulation and may require iterative refinement for optimal results. Overall, our findings suggest that general-purpose generative models hold strong potential as unified image restoration solvers, while highlighting the importance of controllability and robustness. All test results are available on https://github.com/yxyuanxiao/NanoBanana2TestOnIR.

关键词: image restoration, generative AI, Nano Banana 2, prompt design, evaluation, perceptual quality, generalization, controllability

150. ❌ STEAR: Layer-Aware Spatiotemporal Evidence Intervention for Hallucination Mitigation in Video Large Language Models

作者: Linfeng Fan, Yuan Tian, Ziwei Li, Zhiwu Lu 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03045v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	15.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文STEAR专注于视频大语言模型（Video-LLMs）中的幻觉缓解问题，核心贡献是提出了一种层感知的时空证据干预框架。因此，与’Large Language Models’高度相关（10分），因为Video-LLMs是LLMs在视频领域的应用；与’Hallucination Mitigation’高度相关（15分），这是论文的核心研究问题。与’Mechanistic Interpretability’有一定关联（5分），因为论文分析了不同解码层对视觉基础和语言组合的贡献，涉及模型内部工作机制的理解。其他关键词如MoE、SLMs、Scaling Laws、训练方法（Pre-training、SFT、RLHF等）、推理优化（KV Cache、Speculative Decoding）、智能体（Agents）、模型压缩（Quantization）以及科学AI应用等，在论文标题和摘要中均未提及或讨论，因此评分为0分。

!!! tip deepseek-chat TL;DR

论文STEAR针对视频大语言模型中的时空幻觉问题，提出了一种层感知的时空证据干预框架，通过识别高风险解码步骤并利用中间层的视觉证据进行干预，有效减少了幻觉并提高了模型的忠实性、时间一致性和鲁棒性。

摘要翻译

视频大语言模型（Video-LLMs）仍易产生时空幻觉，常生成缺乏视觉依据的细节或错误的时间关系。现有缓解方法通常将幻觉视为统一的解码失败，并应用全局共享的校正规则。我们则发现，解码器各层对视觉基础与后续语言组合的贡献存在差异，这表明干预必须具有层级感知性。基于这一观察，我们提出了STEAR——一种层级感知的时空证据干预框架。STEAR首先识别高风险解码步骤，并从对基础敏感的中间层中选取以标记为条件的视觉证据。该框架将这一共享证据用于两个耦合目标：在中间层恢复缺失的局部基础信息，并构建时间扰动的块级反事实样本，以在后期解码层中证伪不一致的推理过程。因此，STEAR能够在高效的单次编码推理框架内同时缓解空间与时间幻觉。在多个代表性Video-LLM主干模型及具有挑战性的基准测试上的实验表明，STEAR能持续减少幻觉，同时提升忠实度、时间一致性与鲁棒性。我们的结果证实，可靠的视频解码依赖于在正确的层级对精确证据进行干预，而非施加全局惩罚。代码已提供于补充材料中。

摘要 (Abstract)

Video Large Language Models (Video-LLMs) remain prone to spatiotemporal hallucinations, often generating visually unsupported details or incorrect temporal relations. Existing mitigation methods typically treat hallucination as a uniform decoding failure, applying globally shared correction rules. We instead observe that decoder layers contribute differently to visual grounding and later linguistic composition, indicating that intervention must be layer-aware. Based on this insight, we propose STEAR, a layer-aware spatiotemporal evidence intervention framework. STEAR identifies high-risk decoding steps and selects token-conditioned visual evidence from grounding-sensitive middle layers. It uses this shared evidence for two coupled purposes: restoring missing local grounding in middle layers, and constructing temporally perturbed patch-level counterfactuals to falsify inconsistent reasoning during late-layer decoding. Consequently, STEAR mitigates both spatial and temporal hallucinations within an efficient single-encode inference framework. Experiments across representative Video-LLM backbones and challenging benchmarks demonstrate that STEAR consistently reduces hallucinations while improving faithfulness, temporal consistency, and robustness. Our results confirm that reliable video decoding relies on intervening on precise evidence at the right layer, rather than enforcing a global penalty. The code is provided in the Supplementary Material.

关键词: Video Large Language Models, Hallucination Mitigation, Spatiotemporal Hallucinations, Layer-aware Intervention, Visual Grounding, Temporal Consistency, Decoder Layers, Single-encode Inference

151. ❌ QVAD: A Question-Centric Agentic Framework for Efficient and Training-Free Video Anomaly Detection

作者: Lokman Bekit, Hamza Karim, Nghia T Nguyen, Yasin Yilmaz 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03040v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	10.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是提出QVAD框架，使用LLM作为智能体与视觉语言模型(VLM)进行动态对话，实现免训练的视频异常检测。高度相关的关键词包括：LLMs（框架核心组件）、SLMs/On-device AI（强调轻量级模型部署在边缘设备）、LLM Agents（框架本质是智能体系统）。中等相关的关键词：Chain of Thought和System 2 Thinking（涉及迭代推理过程，但非主要技术）。其他关键词如MoE、Scaling Laws、各种训练方法、RAG、量化等均未涉及。

!!! tip deepseek-chat TL;DR

该论文提出QVAD框架，通过LLM智能体与视觉语言模型的动态对话机制，实现了免训练、高效率的视频异常检测，在多个数据集上达到先进性能并适用于资源受限的边缘设备。

摘要翻译

视频异常检测是计算机视觉领域的一项基础性挑战，其难点尤其在于异常本身的开集特性。尽管近期利用视觉语言模型的免训练方法展现出潜力，但它们通常依赖庞大且资源密集的基础模型，以弥补静态提示的模糊性。我们认为，视频异常检测的瓶颈未必在于模型容量，而在于查询方式的静态特性。为此，我们提出QVAD，一个以问题为中心的智能体框架，将视觉语言模型与大语言模型的交互视为动态对话。通过基于视觉上下文迭代优化查询，我们的大语言模型智能体能够引导较小的视觉语言模型生成高保真描述和精确的语义推理，而无需更新模型参数。这种“提示更新”机制有效释放了轻量级模型的潜在能力，使其在UCF-Crime、XD-Violence和UBNormal数据集上仅用竞争方法所需参数量的一小部分，即实现了最先进的性能。我们进一步在单场景ComplexVAD数据集上展示了卓越的泛化能力。至关重要的是，QVAD以极小的内存占用实现了高推理速度，使得先进的视频异常检测能力能够部署在资源受限的边缘设备上。

摘要 (Abstract)

Video Anomaly Detection (VAD) is a fundamental challenge in computer vision, particularly due to the open-set nature of anomalies. While recent training-free approaches utilizing Vision-Language Models (VLMs) have shown promise, they typically rely on massive, resource-intensive foundation models to compensate for the ambiguity of static prompts. We argue that the bottleneck in VAD is not necessarily model capacity, but rather the static nature of inquiry. We propose QVAD, a question-centric agentic framework that treats VLM-LLM interaction as a dynamic dialogue. By iteratively refining queries based on visual context, our LLM agent guides smaller VLMs to produce high-fidelity captions and precise semantic reasoning without parameter updates. This ``prompt-updating” mechanism effectively unlocks the latent capabilities of lightweight models, enabling state-of-the-art performance on UCF-Crime, XD-Violence, and UBNormal using a fraction of the parameters required by competing methods. We further demonstrate exceptional generalizability on the single-scene ComplexVAD dataset. Crucially, QVAD achieves high inference speeds with minimal memory footprints, making advanced VAD capabilities deployable on resource-constrained edge devices.

关键词: Video Anomaly Detection, Agentic Framework, Large Language Models, Vision-Language Models, Training-Free, Edge Devices, Dynamic Dialogue, Prompt-Updating

152. ❌ GenSmoke-GS: A Multi-Stage Method for Novel View Synthesis from Smoke-Degraded Images Using a Generative Model

作者: Qida Cao, Xinyuan Hu, Changyue Shi, Jiajun Ding, Zhou Yu, Jun Yu 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03039v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文主要研究从烟雾退化图像进行新颖视图合成的计算机视觉任务，使用多阶段管道（包括图像恢复、去雾、MLLM增强、3DGS-MCMC优化等）。虽然摘要中提到使用’MLLM-based enhancement’（可能指多模态大语言模型），但论文核心是3D重建和图像处理，而非大模型技术原理或科学应用创新。所有关键词中，只有’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），因为该任务属于计算机视觉在科学/工程应用（图像恢复和3D重建），但并非生物信息学或化学信息学等典型科学领域。其他关键词均与大模型技术、训练方法、推理优化、代理系统等无关，故评0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种多阶段方法GenSmoke-GS，用于从烟雾退化图像中合成新颖视图，通过图像恢复、去雾、MLLM增强和3DGS-MCMC优化等步骤，在NTIRE 2026挑战赛中取得了排名第一的性能。

摘要翻译

本文阐述了我们在NTIRE 2026三维恢复与重建（3DRR）挑战赛赛道二（针对烟雾退化图像）中所采用的方法。在此任务中，烟雾会降低图像可见度，并削弱场景优化与渲染所需的跨视角一致性。我们通过一个多阶段流程来解决此问题，该流程包括图像恢复、去雾、基于多模态大语言模型（MLLM）的增强、3D高斯泼溅-马尔可夫链蒙特卡洛（3DGS-MCMC）优化以及对多次运行结果进行平均。该流程的主要目的是在渲染前提升可见性，同时限制输入视角间场景内容的变化。在挑战赛基准测试上的实验结果表明，与所提供基线方法相比，我们的方法在量化指标上有所提升，并获得了更好的视觉质量。代码发布于 https://github.com/plbbl/GenSmoke-GS。根据官方竞赛网站（https://www.codabench.org/competitions/13993/#/results-tab）公布的结果，我们的方法在NTIRE 3DRR挑战赛赛道二中，于14支参赛队伍中排名第一。

摘要 (Abstract)

This paper describes our method for Track 2 of the NTIRE 2026 3D Restoration and Reconstruction (3DRR) Challenge on smoke-degraded images. In this task, smoke reduces image visibility and weakens the cross-view consistency required by scene optimization and rendering. We address this problem with a multi-stage pipeline consisting of image restoration, dehazing, MLLM-based enhancement, 3DGS-MCMC optimization, and averaging over repeated runs. The main purpose of the pipeline is to improve visibility before rendering while limiting scene-content changes across input views. Experimental results on the challenge benchmark show improved quantitative performance and better visual quality than the provided baselines. The code is available at https://github.com/plbbl/GenSmoke-GS. Our method achieved a ranking of 1 out of 14 participants in Track 2 of the NTIRE 3DRR Challenge, as reported on the official competition website: https://www.codabench.org/competitions/13993/#/results-tab.

关键词: novel view synthesis, smoke-degraded images, multi-stage pipeline, image restoration, dehazing, MLLM-based enhancement, 3DGS-MCMC optimization, 3D reconstruction

153. ❌ Explicit Time-Frequency Dynamics for Skeleton-Based Gait Recognition

作者: Seoyeon Ko, Yeojin Song, Egene Chung, Luca Quagliato, Taeyong Lee, Junhyug Noh 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03002v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于骨架步态识别，提出了一种基于小波变换的时间-频率动态特征提取方法，与所有评分关键词（均围绕大模型、深度学习技术原理、AI科学应用等）完全无关。论文不涉及语言模型、训练技术、推理优化、对齐、代理系统、模型压缩、幻觉缓解、可解释性、世界模型、模型合并、上下文学习或AI科学应用等主题。

!!! tip deepseek-chat TL;DR

该论文针对骨架步态识别中运动动态特征利用不足的问题，提出了一种可插拔的小波特征流，通过连续小波变换提取关节速度的时间-频率动态特征，与主干网络融合后显著提升了识别性能，尤其在携带包和穿外套等协变量变化下表现突出。

摘要翻译

基于骨架的步态识别器擅长建模空间构型，但往往未能充分利用在表观变化下至关重要的显式运动动态。我们提出一种即插即用的小波特征流，通过关节速度的时频动态增强任意骨架主干网络。具体而言，每个关节的速度序列通过连续小波变换转换为多尺度尺度图，并由一个轻量级多尺度卷积神经网络从中学习判别性动态特征。生成的描述符与主干网络表征融合用于分类，无需改变主干架构或引入额外监督。在CASIA-B数据集上，该特征流在多个强骨架主干网络（如GaitMixer、GaitFormer、GaitGraph）上均带来稳定性能提升，当与GaitMixer结合时更建立了基于骨架方法的新最优结果。在携带包裹（BG）和穿着外套（CL）等协变量变化条件下改进尤为显著，这凸显了显式时频建模与标准时空编码器的互补性。

摘要 (Abstract)

Skeleton-based gait recognizers excel at modeling spatial configurations but often underuse explicit motion dynamics that are crucial under appearance changes. We introduce a plug-and-play Wavelet Feature Stream that augments any skeleton backbone with time-frequency dynamics of joint velocities. Concretely, per-joint velocity sequences are transformed by the continuous wavelet transform (CWT) into multi-scale scalograms, from which a lightweight multi-scale CNN learns discriminative dynamic cues. The resulting descriptor is fused with the backbone representation for classification, requiring no changes to the backbone architecture or additional supervision. Across CASIA-B, the proposed stream delivers consistent gains on strong skeleton backbones (e.g., GaitMixer, GaitFormer, GaitGraph) and establishes a new skeleton-based state of the art when attached to GaitMixer. The improvements are especially pronounced under covariate shifts such as carrying bags (BG) and wearing coats (CL), highlighting the complementarity of explicit time-frequency modeling and standard spatio-temporal encoders.

关键词: skeleton-based gait recognition, time-frequency dynamics, continuous wavelet transform, joint velocities, multi-scale scalograms, plug-and-play stream, covariate shifts, CASIA-B dataset

154. ❌ Rendering Multi-Human and Multi-Object with 3D Gaussian Splatting

作者: Weiquan Wang, Jun Xiao, Feifei Shao, Yi Yang, Yueting Zhuang, Long Chen 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02996v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的3D场景重建技术，使用3D高斯泼溅方法解决多人物多物体动态场景渲染问题。论文内容完全不涉及大语言模型、深度学习技术原理、AI for Science等关键词领域，所有关键词均与论文研究内容无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于3D高斯泼溅的层次化框架MM-GS，解决了从稀疏视图输入中重建多人物多物体动态场景的挑战，实现了高保真度的渲染效果和合理的实例间交互。

摘要翻译

从稀疏视角输入中重建包含多个交互人体与物体的动态场景，是一项关键而富有挑战性的任务，对于为机器人及VR/AR创建高保真数字孪生至关重要。这一被我们称为“多人体-多物体”（Multi-Human Multi-Object, MHMO）渲染的问题，面临两大核心障碍：一是在严重相互遮挡条件下为每个独立实例实现视角一致的表征；二是对因交互而产生的复杂组合式依赖关系进行显式建模。为克服这些挑战，我们提出了MM-GS——一个基于3D高斯泼溅（3D Gaussian Splatting）构建的新型分层框架。我们的方法首先通过“单实例多视图融合模块”，聚合所有可用视角的视觉信息，为每个实例建立鲁棒且一致的表征。随后，“场景级实例交互模块”在全局场景图（scene graph）上运作，推理所有参与者之间的关系，并优化其属性以捕捉细微的交互效应。在多个高难度数据集上的大量实验表明，我们的方法显著优于现有强基线模型，能够生成具有高保真细节和合理实例间接触状态的先进结果。

摘要 (Abstract)

Reconstructing dynamic scenes with multiple interacting humans and objects from sparse-view inputs is a critical yet challenging task, essential for creating high-fidelity digital twins for robotics and VR/AR. This problem, which we term Multi-Human Multi-Object (MHMO) rendering, presents two significant obstacles: achieving view-consistent representations for individual instances under severe mutual occlusion, and explicitly modeling the complex and combinatorial dependencies that arise from their interactions. To overcome these challenges, we propose MM-GS, a novel hierarchical framework built upon 3D Gaussian Splatting. Our method first employs a Per-Instance Multi-View Fusion module to establish a robust and consistent representation for each instance by aggregating visual information across all available views. Subsequently, a Scene-Level Instance Interaction module operates on a global scene graph to reason about relationships between all participants, refining their attributes to capture subtle interaction effects. Extensive experiments on challenging datasets demonstrate that our method significantly outperforms strong baselines, producing state-of-the-art results with high-fidelity details and plausible inter-instance contacts.

关键词: 3D Gaussian Splatting, Multi-Human Multi-Object rendering, dynamic scene reconstruction, sparse-view inputs, digital twins, view-consistent representation, instance interaction, scene graph

155. ❌ Not All Frames Deserve Full Computation: Accelerating Autoregressive Video Generation via Selective Computation and Predictive Extrapolation

作者: Hanshuai Cui, Zhiqing Tang, Zhi Yao, Fanshuai Meng, Weijia Jia, Wei Zhao 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02979v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于视频生成模型的推理加速技术，与绝大多数关键词无关。仅与’Speculative Decoding OR Inference Acceleration’有一定关联（5分），因为论文提出了SCOPE框架来加速自回归视频扩散模型的推理过程，属于推理加速范畴，但并非专门针对大语言模型的推测解码技术。

!!! tip deepseek-chat TL;DR

该论文针对自回归视频扩散模型推理效率低的问题，提出了SCOPE训练免费框架，通过选择性计算和预测外推实现了高达4.73倍的加速，同时保持输出质量。

摘要翻译

自回归（AR）视频扩散模型能够生成长视频，但由于需要重复的多步去噪过程，其计算成本仍然高昂。现有的免训练加速方法依赖于二元的缓存或重计算决策，忽略了中间情况：直接复用过于粗略，而完全重计算又非必要。此外，异步AR调度机制为共同生成的帧分配了不同的噪声水平，而现有方法却对整个有效区间进行统一处理。针对这些AR模型特有的效率问题，我们提出了SCOPE，一个用于高效AR视频扩散的免训练框架。SCOPE引入了一种涵盖缓存、预测和重计算的三模态调度器，其中通过噪声水平泰勒外推进行的预测，在误差传播分析支持下的显式稳定性控制下，填补了复用与重计算之间的空白。该框架进一步引入了选择性计算机制，将执行范围限制在活跃帧区间内。在MAGI-1和SkyReels-V2数据集上，SCOPE在保持与原始输出相当质量的同时，实现了高达4.73倍的加速，性能优于所有免训练的基线方法。

摘要 (Abstract)

Autoregressive (AR) video diffusion models enable long-form video generation but remain expensive due to repeated multi-step denoising. Existing training-free acceleration methods rely on binary cache-or-recompute decisions, overlooking intermediate cases where direct reuse is too coarse yet full recomputation is unnecessary. Moreover, asynchronous AR schedules assign different noise levels to co-generated frames, yet existing methods process the entire valid interval uniformly. To address these AR-specific inefficiencies, we present SCOPE, a training-free framework for efficient AR video diffusion. SCOPE introduces a tri-modal scheduler over cache, predict, and recompute, where prediction via noise-level Taylor extrapolation fills the gap between reuse and recomputation with explicit stability controls backed by error propagation analysis. It further introduces selective computation that restricts execution to the active frame interval. On MAGI-1 and SkyReels-V2, SCOPE achieves up to 4.73x speedup while maintaining quality comparable to the original output, outperforming all training-free baselines.

关键词: Autoregressive Video Generation, Video Diffusion Models, Inference Acceleration, Selective Computation, Predictive Extrapolation, Training-free Framework, SCOPE, Noise-level Taylor Extrapolation

156. ❌ Exploring Motion-Language Alignment for Text-driven Motion Generation

作者: Ruxi Gu, Zilei Wang, Wei Wang 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02973v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究文本驱动的人体运动生成，聚焦于运动-语言对齐问题，提出MLA-Gen框架并解决注意力集中问题。虽然涉及对齐概念，但这是特定于运动生成任务的语义对齐，而非大模型的价值对齐或指令调优。论文未使用大模型、深度学习技术原理创新或科学AI应用，与绝大多数关键词无关。仅与’Alignment’有弱关联（5分），因为都涉及语义对齐，但具体领域和技术不同。

!!! tip deepseek-chat TL;DR

该论文解决了文本驱动人体运动生成中运动-语言对齐的挑战，提出了MLA-Gen框架和注意力控制策略，显著提升了运动质量和语义对齐效果。

摘要翻译

文本驱动的人体运动生成旨在合成符合文本描述的真实运动序列。尽管近期研究取得了进展，但如何精确地将运动动态与文本语义对齐仍然是一个根本性挑战。本文从运动-语言对齐的视角重新审视文本到运动的生成问题，提出了MLA-Gen框架，该框架将全局运动先验与细粒度局部条件相结合。这一设计使模型能够捕捉常见的运动模式，同时在文本与运动之间建立精细的对齐关系。此外，我们发现人体运动生成中存在一个先前被忽视的注意力沉没现象，即注意力过度集中于起始文本标记，限制了信息性文本线索的利用，导致语义基础弱化。为分析此问题，我们引入了SinkRatio这一用于衡量注意力集中程度的指标，并开发了对齐感知掩码与控制策略，以在生成过程中调节注意力分布。大量实验表明，相较于现有强基线模型，我们的方法在运动质量与运动-语言对齐方面均取得了持续提升。代码将在论文录用后公开。

摘要 (Abstract)

Text-driven human motion generation aims to synthesize realistic motion sequences that follow textual descriptions. Despite recent advances, accurately aligning motion dynamics with textual semantics remains a fundamental challenge. In this paper, we revisit text-to-motion generation from the perspective of motion-language alignment and propose MLA-Gen, a framework that integrates global motion priors with fine-grained local conditioning. This design enables the model to capture common motion patterns, while establishing detailed alignment between texts and motions. Furthermore, we identify a previously overlooked attention sink phenomenon in human motion generation, where attention disproportionately concentrates on the start text token, limiting the utilization of informative textual cues and leading to degraded semantic grounding. To analyze this issue, we introduce SinkRatio, a metric for measuring attention concentration, and develop alignment-aware masking and control strategies to regulate attention during generation. Extensive experiments demonstrate that our approach consistently improves both motion quality and motion-language alignment over strong baselines. Code will be released upon acceptance.

关键词: text-driven motion generation, motion-language alignment, attention sink, SinkRatio, semantic grounding, human motion synthesis, global motion priors, local conditioning

157. ❌ Effect of Input Resolution on Retinal Vessel Segmentation Performance: An Empirical Study Across Five Datasets

作者: Amarnath R 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02977v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究视网膜血管分割中图像分辨率对薄血管检测性能的影响，使用UNet在五个数据集上进行实验，并提出了宽度分层敏感性指标。论文属于医学图像分析领域，与大多数大模型和深度学习技术原理创新关键词完全无关。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究涉及生物医学图像分析（视网膜图像），属于AI在科学领域的应用，但并非核心创新或大模型相关，因此给予5分（有一定关联）。其他关键词均未涉及大模型、MoE、缩放定律、训练技术、推理优化、智能体、量化等主题，全部评分为0分。

!!! tip deepseek-chat TL;DR

该论文研究了视网膜血管分割中图像下采样对薄血管检测的影响，发现标准Dice指标无法捕捉薄血管信息损失，并提出了宽度分层敏感性指标来评估不同宽度血管的检测性能。

摘要翻译

大多数用于视网膜血管分割的深度学习流程会调整眼底图像的尺寸，以满足GPU内存限制并实现统一的批量处理。然而，这种尺寸调整对细小血管检测的影响仍未得到充分探究。当高分辨率图像被下采样时，细小血管会缩减至亚像素结构，导致在数据进入网络之前就发生不可逆的信息丢失。标准的体积度量指标（如Dice分数）无法捕捉这种损失，因为评估过程主要由粗大血管像素主导。我们通过使用五个眼底数据集（DRIVE、STARE、CHASE_DB1、HRF和FIVES，其原始宽度范围从565到3504像素）在多种下采样比例下训练一个基线UNet来研究这一影响，同时保持所有其他设置不变。我们引入了一种宽度分层敏感性度量，该度量使用基于欧几里得距离变换（Euclidean distance transform）获得的原始分辨率宽度估计，分别评估细小（半宽<3像素）、中等（3至7像素）和粗大（>7像素）血管的检测性能。结果表明，对于高分辨率数据集（HRF、FIVES），当图像下采样至编码器的有效操作范围（处理宽度在256至876像素之间）时，细小血管敏感性单调提升并达到峰值。对于中低分辨率数据集（DRIVE、STARE、CHASE_DB1），细小血管敏感性在原始分辨率或接近原始分辨率时最高，任何下采样都会导致其下降。在所有五个数据集中，激进的下采样使细小血管敏感性降低了高达15.8个百分点（DRIVE），而Dice分数保持相对稳定，这证实了仅靠Dice分数不足以评估微血管分割性能。

摘要 (Abstract)

Most deep learning pipelines for retinal vessel segmentation resize fundus images to satisfy GPU memory constraints and enable uniform batch processing. However, the impact of this resizing on thin vessel detection remains underexplored. When high resolution images are downsampled, thin vessels are reduced to subpixel structures, causing irreversible information loss even before the data enters the network. Standard volumetric metrics such as the Dice score do not capture this loss because thick vessel pixels dominate the evaluation. We investigated this effect by training a baseline UNet at multiple downsampling ratios across five fundus datasets (DRIVE, STARE, CHASE_DB1, HRF, and FIVES) with native widths ranging from 565 to 3504 pixels, keeping all other settings fixed. We introduce a width-stratified sensitivity metric that evaluates thin (half-width <3 pixels), medium (3 to 7 pixels), and thick (>7 pixels) vessel detection separately, using native resolution width estimates derived from a Euclidean distance transform. Results show that for high-resolution datasets (HRF, FIVES), thin vessel sensitivity improves monotonically as images are downsampled toward the encoder’s effective operating range, peaking at processed widths between 256 and 876 pixels. For low-to-mid resolution datasets (DRIVE, STARE, CHASE_DB1), thin vessel sensitivity is highest at or near native resolution and degrades with any downsampling. Across all five datasets, aggressive downsampling reduced thin vessel sensitivity by up to 15.8 percentage points (DRIVE) while Dice remained relatively stable, confirming that Dice alone is insufficient for evaluating microvascular segmentation.

关键词: retinal vessel segmentation, image resolution, downsampling, thin vessel detection, UNet, width-stratified sensitivity, fundus images, medical image analysis

158. ❌ Visual Prototype Conditioned Focal Region Generation for UAV-Based Object Detection

作者: Wenhao Li, Zimeng Wu, Yu Wu, Zehua Fu, Jiaxin Chen 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02966v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于无人机目标检测的扩散模型图像生成方法（UAVGen），涉及视觉原型条件扩散模型和焦点区域增强数据管道。所有关键词均与大语言模型（LLM）技术、训练方法、推理优化、代理系统等直接相关，而本文研究的是计算机视觉中的扩散模型应用，与LLM无关。唯一略有相关的是’AI for Science OR Bioinformatics OR Cheminformatics’，因为无人机目标检测可视为AI在科学或工程领域的应用，但并非核心，故给5分。其他关键词完全无关，给0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为UAVGen的新型布局到图像生成框架，通过视觉原型条件扩散模型和焦点区域增强数据管道，解决了无人机目标检测中合成图像存在伪影的问题，显著提升了检测精度。

摘要翻译

基于无人机（UAV）的目标检测是一项关键但具有挑战性的任务，尤其是在标注训练数据有限且场景动态变化的实际应用中。基于扩散模型的布局到图像生成方法已被证明能通过合成带标签的图像有效提升检测精度。然而，这些方法常产生伪影，尤其是在微小目标的布局边界附近，从而严重限制了其性能。为解决这些问题，我们提出了UAVGen，一种专为无人机目标检测设计的新型布局到图像生成框架。具体而言，UAVGen设计了一种视觉原型条件扩散模型（Visual Prototype Conditioned Diffusion Model, VPC-DM），该模型为每个类别构建具有代表性的实例，并将其整合到潜在嵌入中以实现高保真度的目标生成。此外，我们引入了焦点区域增强数据管道（Focal Region Enhanced Data Pipeline, FRE-DP），以在合成过程中强调目标集中的前景区域，并结合标签细化来修正生成结果中缺失、多余或未对齐的部分。大量实验结果表明，我们的方法显著优于现有先进技术，并且在集成到不同检测器中时能持续提升检测精度。源代码发布于https://github.com/Sirius-Li/UAVGen。

摘要 (Abstract)

Unmanned aerial vehicle (UAV) based object detection is a critical but challenging task, when applied in dynamically changing scenarios with limited annotated training data. Layout-to-image generation approaches have proved effective in promoting detection accuracy by synthesizing labeled images based on diffusion models. However, they suffer from frequently producing artifacts, especially near layout boundaries of tiny objects, thus substantially limiting their performance. To address these issues, we propose UAVGen, a novel layout-to-image generation framework tailored for UAV-based object detection. Specifically, UAVGen designs a Visual Prototype Conditioned Diffusion Model (VPC-DM) that constructs representative instances for each class and integrates them into latent embeddings for high-fidelity object generation. Moreover, a Focal Region Enhanced Data Pipeline (FRE-DP) is introduced to emphasize object-concentrated foreground regions in synthesis, combined with a label refinement to correct missing, extra and misaligned generations. Extensive experimental results demonstrate that our method significantly outperforms state-of-the-art approaches, and consistently promotes accuracy when integrated with distinct detectors. The source code is available at https://github.com/Sirius-Li/UAVGen.

关键词: UAV-based object detection, layout-to-image generation, diffusion models, visual prototype conditioned, focal region enhancement, data synthesis, object detection, UAVGen

作者: Zelin Zhang, Kedi Li, Huiqi Liang, Tao Zhang, Chuanzhi Xu 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02948v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《CrossWeaver: Cross-modal Weaving for Arbitrary-Modality Semantic Segmentation》专注于计算机视觉领域的多模态语义分割，提出了一种新的跨模态融合框架。虽然属于深度学习应用，但研究内容与所有评分关键词（均围绕大语言模型技术、训练方法、推理优化、对齐、代理系统等）完全无关。论文未涉及任何语言模型、大模型技术原理、科学AI应用或评分关键词中的具体技术。

!!! tip deepseek-chat TL;DR

该论文针对多模态语义分割中现有融合方法灵活性不足、跨模态协调效果有限的问题，提出了CrossWeaver框架，通过模态交互块和轻量级融合模块实现了最先进的性能，并具有良好的泛化能力。

摘要翻译

多模态语义分割在利用不同感知模态间的互补信息方面展现出巨大潜力。然而，现有方法通常依赖于精心设计的融合策略，这些策略要么采用模态特定的适配机制，要么依赖松散耦合的交互方式，从而限制了灵活性并导致跨模态协调效果欠佳。此外，这些方法往往难以在不同模态组合间平衡高效信息交换与保持各模态独特特性之间的关系。为应对这些挑战，我们提出CrossWeaver——一种简单而有效的任意模态语义分割多模态融合框架。其核心是模态交互块（Modality Interaction Block, MIB），该模块在编码器内实现选择性且基于可靠性的跨模态交互，同时轻量化的无缝对齐融合（Seam-Aligned Fusion, SAF）模块进一步聚合增强后的特征。在多个多模态语义分割基准数据集上的大量实验表明，我们的框架以最少的附加参数实现了最先进的性能，并对未见过的模态组合展现出强大的泛化能力。

摘要 (Abstract)

Multimodal semantic segmentation has shown great potential in leveraging complementary information across diverse sensing modalities. However, existing approaches often rely on carefully designed fusion strategies that either use modality-specific adaptations or rely on loosely coupled interactions, thereby limiting flexibility and resulting in less effective cross-modal coordination. Moreover, these methods often struggle to balance efficient information exchange with preserving the unique characteristics of each modality across different modality combinations. To address these challenges, we propose CrossWeaver, a simple yet effective multimodal fusion framework for arbitrary-modality semantic segmentation. Its core is a Modality Interaction Block (MIB), which enables selective and reliability-aware cross-modal interaction within the encoder, while a lightweight Seam-Aligned Fusion (SAF) module further aggregates the enhanced features. Extensive experiments on multiple multimodal semantic segmentation benchmarks demonstrate that our framework achieves state-of-the-art performance with minimal additional parameters and strong generalization to unseen modality combinations.

关键词: multimodal semantic segmentation, cross-modal fusion, arbitrary-modality, modality interaction, feature aggregation, generalization, state-of-the-art performance, lightweight framework

160. ❌ Collaborative Multi-Mode Pruning for Vision-Language Models

作者: Zimeng Wu, Yunhong Wang, Donghao Wang, Jiaxin Chen 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02956v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	8.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于视觉语言模型（VLMs）的剪枝压缩技术，提出了一种协同多模式剪枝框架（CoMP）。论文核心是模型压缩技术，与"Quantization OR Model Compression OR Low-bit Weights"关键词高度相关（8分），因为剪枝是模型压缩的重要方法之一。然而，论文研究的是视觉语言模型而非大语言模型（LLMs），且未涉及其他关键词所描述的大模型技术原理（如MoE、Scaling Laws、预训练、对齐、推理加速等）或科学应用领域，因此其他关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文针对视觉语言模型在资源受限设备上部署的挑战，提出了一种协同多模式剪枝框架（CoMP），通过联合参数和令牌剪枝以及自适应策略，在高剪枝率下有效提升了模型性能。

摘要翻译

视觉语言模型（Vision-Language Models, VLMs）在统一的Transformer架构下发展迅速，但由于其高计算复杂度，在资源受限设备上的部署仍面临挑战。剪枝作为一种有效的VLM压缩技术已得到应用，然而现有方法主要集中于单一模式，即仅对参数或令牌进行剪枝，未能充分挖掘各模式内部的冗余性，导致在高剪枝率下性能显著下降。为克服上述局限，本文提出协同多模式剪枝（Collaborative Multi-Mode Pruning, CoMP），这是一个专为VLM设计的、联合执行参数与令牌剪枝的新框架。具体而言，我们首先设计了一种协同重要性度量（Collaborative Importance Metric, CIM），用于探究参数与令牌之间的相互干扰。该度量将令牌的差异性重要性纳入参数重要性分数的计算，同时减轻已剪枝参数对令牌重要性评分的影响。此外，我们开发了一种多模式剪枝策略（Multi-Mode Pruning Strategy, MPS），将整体剪枝过程分解为一系列剪枝阶段，并在每个阶段根据各剪枝模式的剪枝成本估算其优先级，自适应地切换至最优模式。同时，MPS融合历史成本与随机探索机制，以实现稳定的剪枝过程并避免陷入局部最优。通过在多种视觉语言任务和模型上进行广泛实验，结果表明，与现有先进方法相比，本方法在高剪枝率下有效提升了模型性能。源代码发布于https://github.com/Wuzimeng/CoMP.git。

摘要 (Abstract)

Vision-Language Models (VLMs) have advanced rapidly within the unified Transformer architecture, yet their deployment on resource-constrained devices remains challenging due to high computational complexity. While pruning has emerged as an effective technique for compressing VLMs, existing approaches predominantly focus on a single mode by pruning either parameters or tokens, neglecting fully exploring the inherent redundancy in each mode, which leads to substantial performance degradation at high pruning ratios. To address the above limitations, we propose Collaborative Multi-Mode Pruning (CoMP), a novel framework tailored for VLMs by performing joint parameter and token pruning. Specifically, we first design a Collaborative Importance Metric (CIM) that investigates the mutual interference between the coupled parameters and tokens. It incorporates distinct significance of tokens into the computation of parameter importance scores, while simultaneously mitigating the affect of pruned parameters on token importance scores. Moreover, we develop a Multi-Mode Pruning Strategy (MPS) that decomposes the overall pruning process into a sequence of pruning stages, while in each stage we estimate the priory of different pruning modes based on their pruning cost and adaptively shift to the optimal one. Additionally, MPS integrates the historical cost and random exploration, in order to achieve a stable pruning process and avoid local optimum. Extensive experiments across various vision-language tasks and models demonstrate that our method effectively promotes the performance under high pruning ratios by comparing to the state-of-the-art approaches. The source code is available at https://github.com/Wuzimeng/CoMP.git.

关键词: Vision-Language Models, Model Pruning, Parameter Pruning, Token Pruning, Collaborative Multi-Mode Pruning, Model Compression, Resource-Constrained Devices, Transformer Architecture

161. ❌ MMTalker: Multiresolution 3D Talking Head Synthesis with Multimodal Feature Fusion

作者: Bin Liu, Zhixiang Xiong, Zhifen He, Bo Li 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02941v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文MMTalker专注于3D音频驱动面部动画合成，属于计算机视觉和图形学领域，研究内容包括网格参数化、可微分采样、图卷积网络和多模态特征融合。所有评分关键词均与大模型、深度学习技术原理或AI在科学领域的应用相关，而该论文未涉及任何大模型、语言模型、模型训练技术、推理优化、AI代理或科学AI应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为MMTalker的新方法，通过多分辨率表示和多模态特征融合来解决语音驱动3D面部动画合成中唇部同步精度和面部表情真实性的挑战，实验表明其在唇部和眼部运动同步准确性方面显著优于现有方法。

摘要翻译

语音驱动的三维面部动画合成旨在构建从一维语音信号到时变三维面部运动信号的映射关系。现有方法在保持口型同步精度与生成逼真面部表情方面仍面临挑战，这主要源于该跨模态映射的高度不适定性。本文提出一种通过多分辨率表征与多模态特征融合实现三维音频驱动面部动画合成的新方法——MMTalker，该方法能够精确重建三维面部运动的丰富细节。
我们首先通过网格参数化与非均匀可微分采样实现细节丰富的三维面部连续表征。网格参数化技术建立了UV平面与三维面部网格的对应关系，并为连续学习提供真实值基准。可微分非均匀采样通过在每个三角面上设置可学习的采样概率，实现了精确的面部细节捕捉。随后，我们采用残差图卷积网络与双重交叉注意力机制，从多输入模态中提取具有判别力的面部运动特征。所提出的多模态融合策略充分利用了语音的层次化特征与面部网格的显式时空几何特征。最后，通过联合处理规范UV空间中的采样点与编码后的面部运动特征，轻量级回归网络预测出合成说话人脸的逐顶点几何位移。
综合实验表明，本方法在多项指标上显著优于现有先进技术，尤其在唇部与眼部运动的同步精度方面表现突出。

摘要 (Abstract)

Speech-driven three-dimensional (3D) facial animation synthesis aims to build a mapping from one-dimensional (1D) speech signals to time-varying 3D facial motion signals. Current methods still face challenges in maintaining lip-sync accuracy and producing realistic facial expressions, primarily due to the highly ill-posed nature of this cross-modal mapping. In this paper, we introduce a novel 3D audio-driven facial animation synthesis method through multi-resolution representation and multi-modal feature fusion, called MMTalker which can accurately reconstruct the rich details of 3D facial motion. We first achieve the continuous representation of 3D face with details by mesh parameterization and non-uniform differentiable sampling. The mesh parameterization technique establishes the correspondence between UV plane and 3D facial mesh and is used to offer ground truth for the continuous learning. Differentiable non-uniform sampling enables precise facial detail acquisition by setting learnable sampling probability in each triangular face. Next, we employ residual graph convolutional network and dual cross-attention mechanism to extract discriminative facial motion feature from multiple input modalities. This proposed multimodal fusion strategy takes full use of the hierarchical features of speech and the explicit spatiotemporal geometric features of facial mesh. Finally, a lightweight regression network predicts the vertex-wise geometric displacements of the synthesized talking face by jointly processing the sampled points in the canonical UV space and the encoded facial motion features. Comprehensive experiments demonstrate that significant improvements are achieved over state-of-the-art methods, especially in the synchronization accuracy of lip and eye movements.

关键词: 3D talking head synthesis, speech-driven facial animation, multimodal feature fusion, mesh parameterization, graph convolutional network, lip-sync accuracy, facial motion reconstruction, cross-modal mapping

162. ❌ Modality-Specific Hierarchical Enhancement for RGB-D Camouflaged Object Detection

作者: Yuzhen Niu, Yangqing Wang, Ri Cheng, Fusheng Li, Rongshen Wang, Zhichen Yang 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02935v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉领域的RGB-D伪装目标检测，提出了一种基于模态特定层次增强和自适应融合的深度学习框架。所有评分关键词均与大语言模型、模型训练优化、推理加速、对齐技术、智能体系统等大模型核心技术相关，而本论文研究的是传统的视觉检测任务，未涉及任何大模型技术或科学AI应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对RGB-D伪装目标检测中模态特征利用不足的问题，提出了MHENet框架，通过纹理和几何层次增强模块以及自适应动态融合模块，在四个基准测试中超越了16种现有方法。

摘要翻译

伪装目标检测（COD）因目标与背景高度相似而极具挑战性，现有方法主要通过互补利用RGB-D的纹理与几何线索来解决此问题。然而，当前RGB-D COD方法仍未能充分利用模态特异性线索，这限制了融合质量。我们认为，这是由于RGB与深度特征在骨干网络提取后直接进行融合，缺乏模态特异性增强所致。为突破此局限，我们提出MHENet——一个通过模态特异性层级增强与自适应融合来处理RGB与深度特征的RGB-D COD框架。具体而言，我们设计了纹理层级增强模块（Texture Hierarchical Enhancement Module, THEM），通过提取高频信息来放大细微的纹理变化；同时构建几何层级增强模块（Geometry Hierarchical Enhancement Module, GHEM），借助可学习的梯度提取增强几何结构，并保持跨尺度语义一致性。最后，自适应动态融合模块（Adaptive Dynamic Fusion Module, ADFM）以空间动态权重自适应融合增强后的纹理与几何特征。在四个基准数据集上的实验表明，MHENet在定性与定量评估上均优于16种前沿方法。代码公开于https://github.com/afdsgh/MHENet。

摘要 (Abstract)

Camouflaged object detection (COD) is challenging due to high target-background similarity, and recent methods address this by complementarily using RGB-D texture and geometry cues. However, RGB-D COD methods still underutilize modality-specific cues, which limits fusion quality. We believe this is because RGB and depth features are fused directly after backbone extraction without modality-specific enhancement. To address this limitation, we propose MHENet, an RGB-D COD framework that performs modality-specific hierarchical enhancement and adaptive fusion of RGB and depth features. Specifically, we introduce a Texture Hierarchical Enhancement Module (THEM) to amplify subtle texture variations by extracting high-frequency information and a Geometry Hierarchical Enhancement Module (GHEM) to enhance geometric structures via learnable gradient extraction, while preserving cross-scale semantic consistency. Finally, an Adaptive Dynamic Fusion Module (ADFM) adaptively fuses the enhanced texture and geometry features with spatially varying weights. Experiments on four benchmarks demonstrate that MHENet surpasses 16 state-of-the-art methods qualitatively and quantitatively. Code is available at https://github.com/afdsgh/MHENet.

关键词: Camouflaged Object Detection, RGB-D, Modality-specific Enhancement, Hierarchical Enhancement, Adaptive Fusion, Texture Enhancement, Geometry Enhancement, Deep Learning

163. ❌ PolyReal: A Benchmark for Real-World Polymer Science Workflows

作者: Wanhao Liu, Weida Wang, Jiaqing Xie, Suorong Yang, Jue Wang, Benteng Chen, Guangtao Mei, Zonglin Yang, Shufei Zhang, Yuchun Mo, Lang Cheng, Jin Zeng, Houqiang Li, Wanli Ouyang, Yuqiang Li 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02934v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究多模态大语言模型（MLLMs）在聚合物科学领域的评估基准，因此与’Large Language Models’高度相关（10分）。论文属于AI for Science应用领域，与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。论文涉及实验机制推理等任务，与推理相关关键词’Chain of Thought’和’System 2 Thinking’有一定关联（各5分）。其他关键词如MoE、量化、对齐等未在摘要中提及，评0分。

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型在真实世界聚合物科学工作流中评估不足的问题，提出了PolyReal基准，发现模型在知识推理任务上表现良好，但在实践性任务上存在显著差距。

摘要翻译

多模态大语言模型（MLLMs）在通用领域表现出色，但在处理复杂、真实的科学问题时仍面临挑战。我们认为，聚合物科学作为一个横跨化学、物理、生物和工程学的交叉学科领域，因其多样化的多模态数据，是理想的高风险测试平台。然而，现有与聚合物科学相关的基准测试大多忽视了真实世界的工作流程，限制了其实用性，且未能系统性地评估MLLMs在基于实践的完整实验生命周期中的表现。为此，我们提出了PolyReal——一个基于真实科学实践的新型多模态基准测试，用于评估MLLMs在聚合物实验全生命周期中的能力。它涵盖五项关键能力：（1）基础知识应用；（2）实验室安全分析；（3）实验机理推理；（4）原始数据提取；以及（5）性能与应用探索。我们在PolyReal上对主流MLLMs的评估揭示了能力失衡现象：模型在知识密集型推理任务（如实验机理推理）上表现良好，但在基于实践的任务（如实验室安全分析和原始数据提取）上表现大幅下降。这暴露了抽象科学知识与其实践性、情境依赖性应用之间的严重脱节，表明这些真实世界任务对MLLMs而言仍具挑战性。因此，PolyReal有助于弥补这一评估缺口，并为评估人工智能系统在真实科学工作流程中的表现提供了一个实用的基准。

摘要 (Abstract)

Multimodal Large Language Models (MLLMs) excel in general domains but struggle with complex, real-world science. We posit that polymer science, an interdisciplinary field spanning chemistry, physics, biology, and engineering, is an ideal high-stakes testbed due to its diverse multimodal data. Yet, existing benchmarks related to polymer science largely overlook real-world workflows, limiting their practical utility and failing to systematically evaluate MLLMs across the full, practice-grounded lifecycle of experimentation. We introduce PolyReal, a novel multimodal benchmark grounded in real-world scientific practices to evaluate MLLMs on the full lifecycle of polymer experimentation. It covers five critical capabilities: (1) foundational knowledge application; (2) lab safety analysis; (3) experiment mechanism reasoning; (4) raw data extraction; and (5) performance & application exploration. Our evaluation of leading MLLMs on PolyReal reveals a capability imbalance. While models perform well on knowledge-intensive reasoning (e.g., Experiment Mechanism Reasoning), they drop sharply on practice-based tasks (e.g., Lab Safety Analysis and Raw Data Extraction). This exposes a severe gap between abstract scientific knowledge and its practical, context-dependent application, showing that these real-world tasks remain challenging for MLLMs. Thus, PolyReal helps address this evaluation gap and provides a practical benchmark for assessing AI systems in real-world scientific workflows.

关键词: Multimodal Large Language Models, Polymer Science, Benchmark, Real-world Workflows, Scientific Evaluation, Capability Imbalance, Practice-based Tasks, Knowledge Application

164. ❌ BEVPredFormer: Spatio-temporal Attention for BEV Instance Prediction in Autonomous Driving

作者: Miguel Antunes-García, Santiago Montiel-Marín, Fabio Sánchez-García, Rodrigo Gutiérrez-Moreno, Rafael Barea, Luis M. Bergasa 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02930v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于自动驾驶中的BEV实例预测，使用基于注意力的时空处理和3D投影技术，属于计算机视觉和自动驾驶领域。所有评分关键词均与大语言模型、深度学习技术原理或AI科学应用直接相关，而本文不涉及任何大模型技术、语言模型、训练方法、推理技术或AI科学应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

论文提出BEVPredFormer，一种基于注意力机制的相机BEV实例预测架构，用于自动驾驶场景理解，在nuScenes数据集上达到或超越了最先进方法的性能。

摘要翻译

自动驾驶系统必须具备对动态场景演变的鲁棒感知能力，以准确检测、跟踪并预测周围障碍物的行为。依赖模块化架构的传统感知流程往往存在累积误差与延迟问题。实例预测模型提供了一种统一的解决方案，能够利用直接从不同传感器获取的信息，在鸟瞰图视角下对当前及未来帧进行分割与运动估计。然而，这些模型面临的一个关键挑战在于如何有效处理动态驾驶环境中固有的密集时空信息。这种复杂性要求架构能够在不影响实时性能的前提下，捕捉细粒度的运动模式与长程依赖关系。本文提出BEVPredFormer，一种新颖的纯摄像头鸟瞰图实例预测架构。该架构采用基于注意力的时序处理机制来提升对场景的时空理解能力，并依赖于基于注意力的摄像头信息三维投影。BEVPredFormer采用无循环设计，融合了门控变换器层、分离的时空注意力机制以及多尺度头部任务。此外，我们引入了差分引导特征提取模块以增强时序表征能力。大量消融实验验证了各架构组件的有效性。在nuScenes数据集上的评估表明，BEVPredFormer与现有最优方法性能相当或更优，凸显了其在实现鲁棒高效自动驾驶感知方面的潜力。

摘要 (Abstract)

A robust awareness of how dynamic scenes evolve is essential for Autonomous Driving systems, as they must accurately detect, track, and predict the behaviour of surrounding obstacles. Traditional perception pipelines that rely on modular architectures tend to suffer from cumulative errors and latency. Instance Prediction models provide a unified solution, performing Bird’s-Eye-View segmentation and motion estimation across current and future frames using information directly obtained from different sensors. However, a key challenge in these models lies in the effective processing of the dense spatial and temporal information inherent in dynamic driving environments. This level of complexity demands architectures capable of capturing fine-grained motion patterns and long-range dependencies without compromising real-time performance. We introduce BEVPredFormer, a novel camera-only architecture for BEV instance prediction that uses attention-based temporal processing to improve temporal and spatial comprehension of the scene and relies on an attention-based 3D projection of the camera information. BEVPredFormer employs a recurrent-free design that incorporates gated transformer layers, divided spatio-temporal attention mechanisms, and multi-scale head tasks. Additionally, we incorporate a difference-guided feature extraction module that enhances temporal representations. Extensive ablation studies validate the effectiveness of each architectural component. When evaluated on the nuScenes dataset, BEVPredFormer was on par or surpassed State-Of-The-Art methods, highlighting its potential for robust and efficient Autonomous Driving perception.

关键词: BEV instance prediction, autonomous driving, spatio-temporal attention, camera-only architecture, gated transformer layers, difference-guided feature extraction, nuScenes dataset, real-time performance

165. ❌ GP-4DGS: Probabilistic 4D Gaussian Splatting from Monocular Video via Variational Gaussian Processes

作者: Mijeong Kim, Jungtaek Kim, Bohyung Han 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02915v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉和神经图形学领域，提出了一种结合高斯过程和4D高斯泼溅的动态场景概率建模方法。论文内容涉及动态场景重建、不确定性量化、运动估计和时序外推，但完全不涉及大语言模型、深度学习技术原理或AI在科学领域的应用。所有关键词均与大语言模型、深度学习技术或AI科学应用相关，与该论文的计算机视觉/图形学主题完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了GP-4DGS框架，通过将高斯过程集成到4D高斯泼溅中，解决了动态场景概率建模中运动模糊性捕获和预测可靠性评估的问题，实现了不确定性量化、未观测区域运动估计和时序外推能力。

摘要翻译

我们提出GP-4DGS，这是一个将高斯过程（Gaussian Processes，GPs）融入4D高斯泼溅（4D Gaussian Splatting，4DGS）的创新框架，旨在实现对动态场景的规范化概率建模。现有4DGS方法主要关注确定性重建，但其本质上难以捕捉运动模糊性，且缺乏评估预测可靠性的机制。通过利用高斯过程基于核函数的概率特性，我们的方法引入了三项关键能力：（i）运动预测的不确定性量化，（ii）对未观测或稀疏采样区域进行运动估计，以及（iii）在已观测训练帧之外进行时间外推。为将高斯过程扩展到4DGS中大量高斯基元，我们设计了捕捉形变场关联结构的时空核函数，并采用带诱导点的变分高斯过程以实现可处理的推理。实验表明，GP-4DGS在提升重建质量的同时，能提供可靠的不确定性估计，有效识别高运动模糊性区域。通过应对这些挑战，我们的工作朝着连接概率建模与神经图形学迈出了重要一步。

摘要 (Abstract)

We present GP-4DGS, a novel framework that integrates Gaussian Processes (GPs) into 4D Gaussian Splatting (4DGS) for principled probabilistic modeling of dynamic scenes. While existing 4DGS methods focus on deterministic reconstruction, they are inherently limited in capturing motion ambiguity and lack mechanisms to assess prediction reliability. By leveraging the kernel-based probabilistic nature of GPs, our approach introduces three key capabilities: (i) uncertainty quantification for motion predictions, (ii) motion estimation for unobserved or sparsely sampled regions, and (iii) temporal extrapolation beyond observed training frames. To scale GPs to the large number of Gaussian primitives in 4DGS, we design spatio-temporal kernels that capture the correlation structure of deformation fields and adopt variational Gaussian Processes with inducing points for tractable inference. Our experiments show that GP-4DGS enhances reconstruction quality while providing reliable uncertainty estimates that effectively identify regions of high motion ambiguity. By addressing these challenges, our work takes a meaningful step toward bridging probabilistic modeling and neural graphics.

关键词: Gaussian Processes, 4D Gaussian Splatting, probabilistic modeling, dynamic scenes, uncertainty quantification, motion estimation, temporal extrapolation, variational inference

166. ❌ SentiAvatar: Towards Expressive and Interactive Digital Humans

作者: Chuhao Jin, Rui Zhang, Qingzhe Gao, Haoyu Shi, Dayu Wu, Yichen Jiang, Yihan Wu, Ruihua Song 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02908v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文SentiAvatar专注于构建表达性交互式3D数字人，核心是运动生成、音频-运动同步和多模态数据集构建。虽然使用了’Motion Foundation Model’这一术语，但该模型专门针对人体运动序列，并非通用大语言模型或基础模型。论文未涉及任何大语言模型技术、训练方法（如预训练、微调、对齐）、推理优化、代理系统或科学AI应用。所有关键词均与大语言模型及相关技术直接相关，而本文研究领域完全不同，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了SentiAvatar框架，通过构建大规模多模态数据集、预训练运动基础模型和音频感知规划-填充架构，解决了实时生成语义恰当且与语音节奏同步的3D数字人动作的挑战，在多个基准上达到了最先进性能。

摘要翻译

本文提出SentiAvatar框架，用于构建富有表现力的交互式三维数字人，并基于此创建了能够实时说话、手势和表情变化的虚拟角色SuSu。实现此类系统仍面临挑战，因其需同时解决三个关键问题：缺乏大规模高质量多模态数据、鲁棒的语义到动作映射，以及细粒度的帧级动作-韵律同步。为解决这些问题，首先，我们构建了SuSuInterActs数据集（包含2.1万条片段，总时长37小时），该对话语料通过光学动作捕捉系统采集，围绕单一角色录制了同步的语音、全身动作及面部表情。其次，我们在超过20万条动作序列上预训练了一个动作基础模型（Motion Foundation Model），使其具备远超对话场景的丰富动作先验知识。随后，我们提出一种音频感知的“先规划后填充”架构，将句子级语义规划与帧级韵律驱动的插值过程解耦，从而确保生成的动作既语义恰当，又在节奏上与语音同步。实验表明，SentiAvatar在SuSuInterActs（R@1达到43.64%，接近最佳基线的2倍）和BEATv2（FGD为4.941，BC为8.078）数据集上均达到最优性能，可在0.3秒内生成6秒输出，并支持无限多轮流式生成。源代码、模型及数据集已公开于https://sentiavatar.github.io。

摘要 (Abstract)

We present SentiAvatar, a framework for building expressive interactive 3D digital humans, and use it to create SuSu, a virtual character that speaks, gestures, and emotes in real time. Achieving such a system remains challenging, as it requires jointly addressing three key problems: the lack of large-scale, high-quality multimodal data, robust semantic-to-motion mapping, and fine-grained frame-level motion-prosody synchronization. To solve these problems, first, we build SuSuInterActs (21K clips, 37 hours), a dialogue corpus captured via optical motion capture around a single character with synchronized speech, full-body motion, and facial expressions. Second, we pre-train a Motion Foundation Model on 200K+ motion sequences, equipping it with rich action priors that go well beyond the conversation. We then propose an audio-aware plan-then-infill architecture that decouples sentence-level semantic planning from frame-level prosody-driven interpolation, so that generated motions are both semantically appropriate and rhythmically aligned with speech. Experiments show that SentiAvatar achieves state-of-the-art on both SuSuInterActs (R@1 43.64%, nearly 2 times the best baseline) and BEATv2 (FGD 4.941, BC 8.078), producing 6s of output in 0.3s with unlimited multi-turn streaming. The source code, model, and dataset are available at https://sentiavatar.github.io.

关键词: SentiAvatar, expressive interactive digital humans, motion foundation model, audio-aware plan-then-infill, multimodal dialogue corpus, semantic-to-motion mapping, motion-prosody synchronization, real-time generation

167. ❌ UniSpector: Towards Universal Open-set Defect Recognition via Spectral-Contrastive Visual Prompting

作者: Geonuk Kim, Minhoi Kim, Kangil Lee, Minsu Kim, Hyeonseong Jeon, Jeonghoon Han, Hyoungjoon Lim, Junho Yim 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02905v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文UniSpector专注于计算机视觉领域的开放集缺陷识别，提出了一种基于视觉提示（visual prompting）的方法，涉及空间-光谱提示编码器、对比提示编码器和提示引导查询选择等技术。虽然研究背景提到关注大模型和深度学习在科学领域的应用，但该论文的具体内容与所有评分关键词（主要围绕大语言模型、训练技术、推理优化、对齐、代理系统等）均无直接关联。论文属于计算机视觉中的工业检测应用，而非大模型或深度学习技术原理的创新，也未涉及生物医药等科学领域的AI应用。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对工业检测中现有方法难以识别未知缺陷的问题，提出了一种名为UniSpector的通用开放集缺陷识别方法，通过设计语义结构化的提示拓扑和对比学习，在首个视觉提示基准上显著超越了基线方法。

摘要翻译

尽管工业检测系统本应具备识别未知缺陷的能力，但现有方法大多基于封闭集假设运行，导致其无法有效检测新型异常。视觉提示技术为工业检测提供了一种可扩展的替代方案，然而现有方法常因类内差异大、类间差异细微而面临提示嵌入坍缩的问题。为解决这一难题，我们提出UniSpector方法，其核心思想是将研究重点从简单的提示-区域匹配转向设计具有语义结构且可迁移的提示拓扑。UniSpector采用空间-频谱提示编码器提取方向不变性的细粒度表征，为对比式提示编码器提供坚实基础，从而显式地将提示空间规整为语义组织的角度流形。此外，提示引导的查询选择机制能生成与提示对齐的自适应目标查询。我们构建了首个基于视觉提示的开集缺陷定位基准测试集“Inspect Anything”，在该基准上UniSpector以显著优势超越基线方法，在AP50b和AP50m指标上分别提升至少19.7%和15.8%。实验结果表明，我们的方法为持续演进的工业环境提供了一种可扩展、免重训练的检测范式，同时为通用视觉提示设计提供了关键见解。

摘要 (Abstract)

Although industrial inspection systems should be capable of recognizing unprecedented defects, most existing approaches operate under a closed-set assumption, which prevents them from detecting novel anomalies. While visual prompting offers a scalable alternative for industrial inspection, existing methods often suffer from prompt embedding collapse due to high intra-class variance and subtle inter-class differences. To resolve this, we propose UniSpector, which shifts the focus from naive prompt-to-region matching to the principled design of a semantically structured and transferable prompt topology. UniSpector employs the Spatial-Spectral Prompt Encoder to extract orientation-invariant, fine-grained representations; these serve as a solid basis for the Contrastive Prompt Encoder to explicitly regularize the prompt space into a semantically organized angular manifold. Additionally, Prompt-guided Query Selection generates adaptive object queries aligned with the prompt. We introduce Inspect Anything, the first benchmark for visual-prompt-based open-set defect localization, where UniSpector significantly outperforms baselines by at least 19.7% and 15.8% in AP50b and AP50m, respectively. These results show that our method enable a scalable, retraining-free inspection paradigm for continuously evolving industrial environments, while offering critical insights into the design of generic visual prompting.

关键词: open-set defect recognition, visual prompting, spectral-contrastive, industrial inspection, Spatial-Spectral Prompt Encoder, Contrastive Prompt Encoder, Prompt-guided Query Selection, Inspect Anything benchmark

168. ❌ EvaNet: Towards More Efficient and Consistent Infrared and Visible Image Fusion Assessment

作者: Chunyang Cheng, Tianyang Xu, Xiao-Jun Wu, Tao Zhou, Hui Li, Zhangyong Tang, Josef Kittler 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02896v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文主要研究红外与可见光图像融合评估，属于计算机视觉领域。与大多数大模型技术关键词（如MoE、SFT、RLHF等）完全无关。唯一相关的是：1）‘Large Language Models’：摘要中提到使用大语言模型提供感知场景评估，但非核心方法，给5分；2）‘AI for Science’：图像融合可视为AI在科学/工程领域的应用，给5分。其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

该论文针对红外与可见光图像融合评估中传统指标不准确且计算成本高的问题，提出了一种轻量级网络框架，通过分解融合结果并使用大语言模型辅助训练，实现了比传统方法快1000倍且更符合人类视觉感知的评估性能。

摘要翻译

评估在图像融合研究中至关重要，然而现有的大多数评估指标均直接借用自其他视觉任务，未经适当调整。这些传统指标通常基于复杂的图像变换，不仅难以捕捉融合结果的真实质量，而且计算成本高昂。为解决这些问题，我们提出了一种专为图像融合定制的统一评估框架。其核心是一个轻量级网络，采用分而治之策略，旨在高效近似广泛使用的评估指标。与传统方法直接评估融合图像与源图像之间的相似性不同，我们首先将融合结果分解为红外与可见光分量。随后，评估模型用于度量这些分离分量中的信息保留程度，从而有效解耦融合评估过程。在训练阶段，我们引入对比学习策略，并借助大语言模型提供的感知场景评估来指导评估模型的学习。最后，我们提出了首个一致性评估框架，该框架以独立无参考分数与下游任务性能作为客观参照，衡量图像融合指标与人类视觉感知之间的对齐程度。大量实验表明，我们基于学习的评估范式在一系列标准图像融合基准测试中，既实现了卓越的效率（最高可提升1000倍），也展现出更高的一致性。我们的代码将在 https://github.com/AWCXV/EvaNet 公开。

摘要 (Abstract)

Evaluation is essential in image fusion research, yet most existing metrics are directly borrowed from other vision tasks without proper adaptation. These traditional metrics, often based on complex image transformations, not only fail to capture the true quality of the fusion results but also are computationally demanding. To address these issues, we propose a unified evaluation framework specifically tailored for image fusion. At its core is a lightweight network designed efficiently to approximate widely used metrics, following a divide-and-conquer strategy. Unlike conventional approaches that directly assess similarity between fused and source images, we first decompose the fusion result into infrared and visible components. The evaluation model is then used to measure the degree of information preservation in these separated components, effectively disentangling the fusion evaluation process. During training, we incorporate a contrastive learning strategy and inform our evaluation model by perceptual scene assessment provided by a large language model. Last, we propose the first consistency evaluation framework, which measures the alignment between image fusion metrics and human visual perception, using both independent no-reference scores and downstream tasks performance as objective references. Extensive experiments show that our learning-based evaluation paradigm delivers both superior efficiency (up to 1,000 times faster) and greater consistency across a range of standard image fusion benchmarks. Our code will be publicly available at https://github.com/AWCXV/EvaNet.

关键词: image fusion, evaluation framework, infrared and visible images, lightweight network, contrastive learning, large language model, consistency evaluation, human visual perception

169. ❌ Progressive Video Condensation with MLLM Agent for Long-form Video Understanding

作者: Yufei Yin, Yuchen Xing, Qianke Meng, Minghao Chen, Yan Yang, Zhou Yu 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02891v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出ProVCA，一种用于长视频理解的多模态大语言模型（MLLM）代理，通过渐进式视频浓缩迭代定位关键帧。该研究核心涉及MLLM（属于大语言模型/基础模型范畴）和代理工作流（LLM Agents），因此这两个关键词高度相关（10分）。其他关键词如MoE、量化、推理加速、幻觉缓解等，论文未涉及或仅间接提及，故评0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为ProVCA的渐进式视频浓缩代理，通过多粒度迭代定位关键帧，以高效利用多模态大语言模型进行长视频理解，在多个基准测试中取得了最先进的零样本准确率，同时使用了更少的帧数。

摘要翻译

理解长视频需要在有限的计算资源下从长序列中提取与查询相关的信息。现有的“文本优先后接大语言模型”流水线会丢失细粒度视觉线索，而基于视频的多模态大语言模型虽能保留视觉细节，但对帧数需求过高且计算代价昂贵。本研究旨在利用多模态大语言模型实现高效视频理解。我们提出ProVCA——一种渐进式视频浓缩智能体，可在多粒度上迭代定位关键视频帧。ProVCA首先采用片段定位模块识别与查询相关的视频段落，继而通过片段精选模块依据相似性筛选重要片段，最后通过关键帧优化模块在片段中精确定位具体关键帧。通过从粗粒度段落逐步聚焦到细粒度帧，ProVCA为基于多模态大语言模型的推理筛选出少量关键帧。该方法在零样本设定下取得了最先进的性能：在EgoSchema上达到69.3%准确率，在NExT-QA上达到80.5%，在IntentQA上达到77.7%，同时使用的帧数少于以往无需训练的方法。

摘要 (Abstract)

Understanding long videos requires extracting query-relevant information from long sequences under tight compute budgets. Existing text-then-LLM pipelines lose fine-grained visual cues, while video-based multimodal large language models (MLLMs) can keep visual details but are too frame-hungry and computationally expensive. In this work, we aim to harness MLLMs for efficient video understanding. We propose ProVCA, a progressive video condensation agent that iteratively locates key video frames at multiple granularities. ProVCA first adopts a segment localization module to identify the video segment relevant to the query, then a snippet selection module to select important snippets based on similarity, and finally a keyframe refinement module to pinpoint specific keyframes in those snippets. By progressively narrowing the scope from coarse segments to fine frames, ProVCA identifies a small set of keyframes for MLLM-based reasoning. ProVCA achieves state-of-the-art zero-shot accuracies of 69.3% on EgoSchema, 80.5% on NExT-QA, and 77.7% on IntentQA, while using fewer frames than previous training-free methods.

关键词: Progressive Video Condensation, MLLM Agent, Long-form Video Understanding, Keyframe Selection, Multimodal Large Language Models, Zero-shot Accuracy, Efficient Video Processing, State-of-the-art Performance

170. ❌ Information-Regularized Constrained Inversion for Stable Avatar Editing from Sparse Supervision

作者: Zhenxiao Liang, Qixing Huang 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02883v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是可动画人体化身的编辑问题，提出了一种基于条件引导的编辑重建框架，通过约束反演在结构化化身潜在空间中进行编辑。论文的核心技术涉及计算机视觉、图形学和优化方法，包括潜在空间编辑、约束反演、Hessian-vector乘积等。所有评分关键词均与大语言模型、深度学习技术原理或AI科学应用相关，而本文专注于计算机视觉和图形学中的特定编辑问题，未涉及任何大模型技术、深度学习创新或AI科学应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文解决了在稀疏监督下编辑可动画人体化身时出现的身份泄漏和时序闪烁问题，提出了一种条件引导的编辑重建框架，通过约束反演在结构化潜在空间中实现稳定编辑。

摘要翻译

编辑可动画人体化身通常依赖于稀疏监督（通常仅为少量编辑后的关键帧），然而将重建的化身简单拟合至这些编辑结果常导致身份信息泄漏与姿态相关的时序闪烁。我们认为这些失败本质上可理解为不适定反演问题：现有的编辑约束不足以确定实现目标编辑所需的潜在空间方向。本文提出一种条件引导的编辑重建框架，该方法在结构化化身潜在空间中执行约束反演以实现编辑，将更新限制在低维度的部件特定编辑子空间内，从而防止非预期的身份变化。关键创新在于，我们通过优化从完整解码-渲染流程的局部线性化导出的条件目标来设计反演过程中的编辑约束，由此生成一个编辑子空间信息矩阵——其谱特征可预测稳定性并驱动帧重加权与关键帧激活机制。该方法仅需操作小型子空间矩阵，可通过高效方式实现（例如基于海森-向量乘积），并在有限编辑监督下显著提升稳定性。

摘要 (Abstract)

Editing animatable human avatars typically relies on sparse supervision, often a few edited keyframes, yet naively fitting a reconstructed avatar to these edits frequently causes identity leakage and pose-dependent temporal flicker. We argue that these failures are best understood as an ill-conditioned inversion: the available edited constraints do not sufficiently determine the latent directions responsible for the intended edit. We propose a conditioning-guided edited reconstruction framework that performs editing as a constrained inversion in a structured avatar latent space, restricting updates to a low-dimensional, part-specific edit subspace to prevent unintended identity changes. Crucially, we design the editing constraints during inversion by optimizing a conditioning objective derived from a local linearization of the full decoding-and-rendering pipeline, yielding an edit-subspace information matrix whose spectrum predicts stability and drives frame reweighting / keyframe activation. The resulting method operates on small subspace matrices and can be implemented efficiently (e.g., via Hessian-vector products), and improves stability under limited edited supervision.

关键词: avatar editing, sparse supervision, constrained inversion, latent space, edit subspace, stability, Hessian-vector products, information matrix

171. ❌ InstructTable: Improving Table Structure Recognition Through Instructions

作者: Boming Chen, Zining Wang, Zhentao Guo, Jianqiang Liu, Chen Duan, Yu Gu, Kai zhou, Pengfei Yan 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02880v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于表格结构识别（TSR）任务，提出了一个指令引导的多阶段训练框架InstructTable。该研究主要涉及计算机视觉（CV）和自然语言处理（NLP）的交叉领域，特别是视觉语言模型（VLM）在特定结构化文档理解任务上的应用。其核心创新在于使用精心设计的表格指令进行预训练，以增强模型对复杂表格结构的理解，并提出了一个合成数据方法TME来构建基准数据集。然而，论文并未明确涉及或讨论评分列表中的绝大多数大模型（LLM）核心技术，如MoE、Scaling Laws、RLHF、PEFT、RAG、推理加速、智能体等。唯一相关的关键词是“Instruction Tuning OR Alignment OR Value Alignment”，因为论文的核心方法是“instruction-guided multi-stage training”和“tabular-data-specific instructions”，这属于指令调优在特定领域（表格理解）的应用，因此给予10分（高度相关，核心内容）。其他关键词与论文的研究焦点（表格图像的结构化解析）无直接关联，故均评为0分。

!!! tip deepseek-chat TL;DR

该论文针对复杂表格结构识别（TSR）的挑战，提出了一个名为InstructTable的指令引导多阶段训练框架，通过表格指令预训练和创新的合成数据方法，在多个基准数据集上实现了最先进的性能。

摘要翻译

表格结构识别（Table Structure Recognition, TSR）通过将表格图像解析为结构化表示而具有广泛的实际应用价值，但在处理涉及合并单元格或空单元格的复杂布局时面临显著挑战。传统以视觉为中心的模型完全依赖视觉信息而缺乏关键的语义支持，从而阻碍了在复杂场景下准确的结构识别。视觉语言模型利用上下文语义来增强理解；然而，这些方法对视觉结构信息的建模重视不足。为应对这些局限，本文提出了InstructTable，一个指令引导的多阶段训练TSR框架。精心设计的表格指令预训练将注意力导向细粒度的结构模式，增强了对复杂表格的理解。互补的TSR微调保留了强大的视觉信息建模能力，确保在不同场景下保持高精度的表格解析。此外，我们引入了Table Mix Expand（TME），一种创新的无模板方法，用于合成大规模的真实表格数据。利用TME，我们构建了平衡复杂密集合成表格（Balanced Complex Dense Synthetic Tables, BCDSTab）基准，包含通过我们方法合成的900张复杂表格图像，以作为一个严格的评估基准。在多个公共数据集（FinTabNet、PubTabNet、MUSTARD）和BCDSTab上进行的大量实验表明，InstructTable在TSR任务中实现了最先进的性能。消融研究进一步证实了所提出的表格数据专用指令和合成数据的积极影响。

摘要 (Abstract)

Table structure recognition (TSR) holds widespread practical importance by parsing tabular images into structured representations, yet encounters significant challenges when processing complex layouts involving merged or empty cells. Traditional visual-centric models rely exclusively on visual information while lacking crucial semantic support, thereby impeding accurate structural recognition in complex scenarios. Vision-language models leverage contextual semantics to enhance comprehension; however, these approaches underemphasize the modeling of visual structural information. To address these limitations, this paper introduces InstructTable, an instruction-guided multi-stage training TSR framework. Meticulously designed table instruction pre-training directs attention toward fine-grained structural patterns, enhancing comprehension of complex tables. Complementary TSR fine-tuning preserves robust visual information modeling, maintaining high-precision table parsing across diverse scenarios. Furthermore, we introduce Table Mix Expand (TME), an innovative template-free method for synthesizing large-scale authentic tabular data. Leveraging TME, we construct the Balanced Complex Dense Synthetic Tables (BCDSTab) benchmark, comprising 900 complex table images synthesized through our method to serve as a rigorous benchmark. Extensive experiments on multiple public datasets (FinTabNet, PubTabNet, MUSTARD) and BCDSTab demonstrate that InstructTable achieves state-of-the-art performance in TSR tasks. Ablation studies further confirm the positive impact of the proposed tabular-data-specific instructions and synthetic data.

关键词: Table Structure Recognition, Instruction-guided Training, Vision-Language Models, Synthetic Data Generation, Complex Table Parsing, Multi-stage Framework, Benchmark Dataset

172. ❌ Unlocking Positive Transfer in Incrementally Learning Surgical Instruments: A Self-reflection Hierarchical Prompt Framework

作者: Yu Zhu, Kang Li, Zheng Li, Pheng-Ann Heng 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02877v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文提出了一种用于手术器械分割的自反思分层提示框架，属于AI在生物医学（手术）领域的应用，与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（评8分）。框架核心机制包含’self-reflection’，与’Self-Correction OR Self-Improvement OR Self-Reflection’高度相关（评10分）。论文提到使用’pre-trained model’并进行增量学习，与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（评5分）。论文未涉及大语言模型（LLMs）、MoE、推理加速、对齐、RAG等具体技术，因此其他关键词评0分。

!!! tip deepseek-chat TL;DR

该论文针对手术视频场景解析中模型增量学习新手术器械时存在的知识迁移不足和灾难性遗忘问题，提出了一种自反思分层提示框架，通过在两个公开基准上分别实现超过5%和11%的性能提升，有效促进了正向和反向知识转移并避免了遗忘。

摘要翻译

为持续提升手术视频场景解析中的模型适应能力，近期研究通过增量更新的方式，使模型逐步学习随时间增长的手术器械分割任务。然而，现有工作普遍忽视了正向知识迁移（即已有知识如何帮助学习新类别）与反向知识迁移（即学习新类别如何优化已有知识）的潜力。本文提出一种自反思层次化提示框架，旨在释放类别增量分割中正向与反向知识迁移的效能，从而高效学习新器械、提升常规器械的现有分割能力，并避免对旧器械的灾难性遗忘。该框架基于冻结的预训练模型构建，通过在训练过程中自适应地为新类别添加器械感知提示。为实现正向知识迁移，我们将器械提示组织为层次化提示解析树：以器械共享提示分区为根节点，n级共享提示分区为中间节点，器械独有提示分区为叶节点，从而向新类别暴露可复用的历史知识以简化其学习过程。反之，为促进反向知识迁移，我们通过有向加权图传播对已有知识进行自反思优化，利用树中记录的知识关联性提升其表征能力，同时避免灾难性遗忘。本框架可同时适用于基于CNN的模型与先进的基于Transformer的基础模型，在两个公开基准测试中分别取得了超过现有方法5%和11%的性能提升。

摘要 (Abstract)

To continuously enhance model adaptability in surgical video scene parsing, recent studies incrementally update it to progressively learn to segment an increasing number of surgical instruments over time. However, prior works constantly overlooked the potential of positive forward knowledge transfer, i.e., how past knowledge could help learn new classes, and positive backward knowledge transfer, i.e., how learning new classes could help refine past knowledge. In this paper, we propose a self-reflection hierarchical prompt framework that unlocks the power of positive forward and backward knowledge transfer in class incremental segmentation, aiming to proficiently learn new instruments, improve existing skills of regular instruments, and avoid catastrophic forgetting of old instruments. Our framework is built on a frozen, pre-trained model that adaptively appends instrument-aware prompts for new classes throughout training episodes. To enable positive forward knowledge transfer, we organize instrument prompts into a hierarchical prompt parsing tree with the instrument-shared prompt partition as the root node, n-part-shared prompt partitions as intermediate nodes and instrument-distinct prompt partitions as leaf nodes, to expose the reusable historical knowledge for new classes to simplify their learning. Conversely, to encourage positive backward knowledge transfer, we conduct self-reflection refining on existing knowledge by directed-weighted graph propagation, examining the knowledge associations recorded in the tree to improve its representativeness without causing catastrophic forgetting. Our framework is applicable to both CNN-based models and advanced transformer-based foundation models, yielding more than 5% and 11% improvements over the competing methods on two public benchmarks respectively.

关键词: surgical instrument segmentation, class incremental learning, knowledge transfer, self-reflection, hierarchical prompt framework, catastrophic forgetting, surgical video parsing, foundation models

173. ❌ SPG: Sparse-Projected Guides with Sparse Autoencoders for Zero-Shot Anomaly Detection

作者: Tomoyasu Nanaumi, Yukino Tsuzuki, Junichi Okubo, Junichiro Fujii, Takayoshi Yamashita 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02871v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	5.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	5.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文使用冻结的基础模型特征进行零样本异常检测，与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。方法涉及稀疏自编码器（SAE），与’Mixture of Experts OR MoE OR Sparse Models’有一定关联（5分）。研究使用预训练模型（如DINOv3、OpenCLIP），与’Pre-training OR Continual Pre-training OR Domain Adaptation’相关（5分）。训练过程涉及在辅助数据集上优化参数，与’Post-training OR Supervised Fine-tuning OR SFT’相关（5分）。方法冻结主干和SAE，仅优化指南系数，类似于参数高效微调，与’PEFT OR LoRA OR Parameter-efficient Fine-tuning’相关（5分）。指南系数可追溯决策到字典原子，提供可解释性，与’Mechanistic Interpretability OR Explainable AI’相关（5分）。其他关键词与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为SPG的零样本异常检测和分割框架，它利用稀疏自编码器在冻结基础模型特征上学习稀疏指南系数，无需目标域适应，在MVTec AD和VisA数据集上实现了竞争力的检测和分割性能。

摘要翻译

本研究探讨利用冻结基础模型特征进行零样本异常检测与分割的方法，其中所有可学习参数仅在带标签的辅助数据集上训练，并直接部署至未见过的目标类别，无需任何目标域适应。现有基于提示词的方法使用手工设计或学习得到的提示嵌入作为正常/异常状态的参考向量。我们提出稀疏投影引导向量框架，这是一种无需提示词的方案，通过在稀疏自编码器潜在空间中学习稀疏引导系数，并借助稀疏自编码器字典生成正常/异常引导向量。该框架在带标签的辅助数据集上采用两阶段学习策略：首先在图像块特征上训练稀疏自编码器；随后在冻结主干网络与稀疏自编码器的前提下，仅利用辅助数据的像素级掩码优化引导系数。在跨数据集零样本设定下的MVTec AD与VisA基准测试中，本方法在图像级检测任务中表现具有竞争力，在像素级分割任务中表现优异；结合DINOv3主干网络时，本方法在对比方法中取得了最高的像素级AUROC。我们还报告了基于OpenCLIP（ViT-L/14@336px）主干网络实例化的结果，以与基于CLIP的基线方法保持主干网络一致性。此外，学习得到的引导系数可将决策溯源至少量字典原子，揭示了跨类别通用与类别特定的异常因素。

摘要 (Abstract)

We study zero-shot anomaly detection and segmentation using frozen foundation model features, where all learnable parameters are trained only on a labeled auxiliary dataset and deployed to unseen target categories without any target-domain adaptation. Existing prompt-based approaches use handcrafted or learned prompt embeddings as reference vectors for normal/anomalous states. We propose Sparse-Projected Guides (SPG), a prompt-free framework that learns sparse guide coefficients in the Sparse Autoencoder (SAE) latent space, which generate normal/anomaly guide vectors via the SAE dictionary. SPG employs a two stage learning strategy on the labeled auxiliary dataset: (i) train an SAE on patch-token features, and (ii) optimize only guide coefficients using auxiliary pixel-level masks while freezing the backbone and SAE. On MVTec AD and VisA under cross-dataset zero-shot settings, SPG achieves competitive image-level detection and strong pixel-level segmentation; with DINOv3, SPG attains the highest pixellevel AUROC among the compared methods. We also report SPG instantiated with OpenCLIP (ViT-L/14@336px) to align the backbone with CLIP-based baselines. Moreover, the learned guide coefficients trace decisions back to a small set of dictionary atoms, revealing category-general and category-specific factors.

关键词: zero-shot anomaly detection, foundation model features, sparse autoencoder, Sparse-Projected Guides, frozen backbone, pixel-level segmentation, DINOv3, OpenCLIP

174. ❌ Token Warping Helps MLLMs Look from Nearby Viewpoints

作者: Phillip Y. Lee, Chanho Park, Mingue Park, Seungwoo Yoo, Juil Koo, Minhyuk Sung 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02870v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于多模态大语言模型（MLLMs）的视觉推理能力，特别是通过令牌扭曲（token warping）技术来提升模型在视角变化下的鲁棒性。论文的核心关键词是’Large Language Models OR LLMs OR Foundation Models’，因为MLLMs是LLMs的一种扩展，直接相关，评分为10分。其他关键词如MoE、SLMs、Scaling Laws、各种训练方法（预训练、后训练、对齐、RLHF、PEFT）、推理技术（RAG、CoT、System 2、MCTS）、模型优化（量化、加速）、可解释性、世界模型、模型合并、上下文学习等，论文均未涉及，评分为0分。‘AI for Science’关键词虽然涉及科学应用，但论文聚焦于计算机视觉和MLLMs技术，而非生物信息学或化学信息学等具体科学领域，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文研究了如何通过令牌扭曲（而非像素扭曲）来帮助多模态大语言模型（MLLMs）从附近视角可靠地推理场景，实验证明该方法在ViewBench基准上优于所有基线。

摘要翻译

扭曲标记而非像素能否帮助多模态大语言模型（MLLMs）理解场景在邻近视角下的呈现方式？尽管MLLMs在视觉推理任务中表现优异，它们对视角变化仍显脆弱，因为像素级扭曲对微小的深度误差高度敏感，且常引入几何畸变。借鉴心理意象理论——该理论将部件级结构表征视为人类视角转换的基础，我们探究基于视觉Transformer（ViT）的MLLMs中的图像标记是否能作为视角变换的有效载体。通过比较前向与后向扭曲方法，我们发现后向标记扭曲在目标视图上定义密集网格并为每个网格点检索对应源视图标记的策略，能在视角变化下实现更高稳定性并更好地保持语义连贯性。在我们提出的ViewBench基准测试上的实验表明，标记级扭曲使MLLMs能够从邻近视角进行可靠推理，其表现持续优于所有基线方法，包括像素级扭曲方案、空间微调的MLLMs以及生成式扭曲方法。

摘要 (Abstract)

Can warping tokens, rather than pixels, help multimodal large language models (MLLMs) understand how a scene appears from a nearby viewpoint? While MLLMs perform well on visual reasoning, they remain fragile to viewpoint changes, as pixel-wise warping is highly sensitive to small depth errors and often introduces geometric distortions. Drawing on theories of mental imagery that posit part-level structural representations as the basis for human perspective transformation, we examine whether image tokens in ViT-based MLLMs serve as an effective substrate for viewpoint changes. We compare forward and backward warping, finding that backward token warping, which defines a dense grid on the target view and retrieves a corresponding source-view token for each grid point, achieves greater stability and better preserves semantic coherence under viewpoint shifts. Experiments on our proposed ViewBench benchmark demonstrate that token-level warping enables MLLMs to reason reliably from nearby viewpoints, consistently outperforming all baselines including pixel-wise warping approaches, spatially fine-tuned MLLMs, and a generative warping method.

关键词: multimodal large language models, MLLMs, token warping, viewpoint changes, visual reasoning, ViT-based models, backward warping, ViewBench benchmark

175. ❌ Few-Shot Distribution-Aligned Flow Matching for Data Synthesis in Medical Image Segmentation

作者: Jie Yang, Ziqi Ye, Aihua Ke, Jian Luo, Bo Cai, Xiaosong Wang 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02868v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文专注于医学图像分割中的数据合成问题，提出了一种基于流匹配（flow matching）的生成模型AlignFlow，通过可微分奖励微调来对齐目标域图像分布。论文的核心技术是生成模型（flow matching）和数据增强，属于计算机视觉和医学图像分析领域。所有关键词（除了最后一个）都明确针对大语言模型（LLM）及其相关技术（如MoE、SFT、RLHF、RAG、推理、代理、量化等），而本文未涉及任何语言模型或文本处理，因此这些关键词评分为0。最后一个关键词“AI for Science OR Bioinformatics OR Cheminformatics”评分为8，因为论文属于AI在科学（具体是医学图像分析）中的应用，与生物信息学相关，但并非核心生物信息学或化学信息学，因此不是满分。

!!! tip deepseek-chat TL;DR

该论文针对医学图像分割中数据异构性问题，提出了一种基于流匹配的生成模型AlignFlow，通过分布对齐机制和可微分奖励微调，在少量参考图像下有效合成图像-掩码对，实验表明其在多个数据集上显著提升了分割性能（mDice提高3.5-4.0%，mIoU提高3.5-5.6%）。

摘要翻译

数据异构性阻碍了医学影像分析模型的临床部署，而生成式数据增强有助于缓解这一问题。然而，当前基于扩散模型合成图像-掩码对的方法往往忽略了生成图像与真实图像在不同场景间的分布偏移，这种不匹配会显著降低下游任务性能。为解决该问题，我们提出了AlignFlow，一种通过可微分奖励微调与目标参考图像分布对齐的流匹配模型，即使在仅提供少量参考图像时仍能保持有效性。具体而言，我们将流匹配模型的训练分为两个阶段：第一阶段，模型拟合训练数据以生成合理的图像；随后，我们引入分布对齐机制，采用可微分奖励引导生成图像朝向目标域给定样本的分布靠拢。此外，为提升生成掩码的多样性，我们还设计了基于流匹配的掩码生成方法，以增强感兴趣区域的多样性。大量实验证明了我们方法的有效性，即在多种数据集和场景下，mDice指标提升3.5-4.0%，mIoU指标提升3.5-5.6%。

摘要 (Abstract)

Data heterogeneity hinders clinical deployment of medical image analysis models, and generative data augmentation helps mitigate this issue. However, recent diffusion-based methods that synthesize image-mask pairs often ignore distribution shifts between generated and real images across scenarios, and such mismatches can markedly degrade downstream performance. To address this issue, we propose AlignFlow, a flow matching model that aligns with the target reference image distribution via differentiable reward fine-tuning, and remains effective even when only a small number of reference images are provided. Specifically, we divide the training of the flow matching model into two stages: in the first stage, the model fits the training data to generate plausible images; Then, we introduce a distribution alignment mechanism and employ differentiable reward to steer the generated images toward the distribution of the given samples from the target domain. In addition, to enhance the diversity of generated masks, we also design a flow matching based mask generation to complement the diversity in regions of interest. Extensive experiments demonstrate the effectiveness of our approach, i.e., performance improvement by 3.5-4.0% in mDice and 3.5-5.6% in mIoU across a variety of datasets and scenarios.

关键词: medical image segmentation, data synthesis, flow matching, distribution alignment, few-shot learning, generative data augmentation, differentiable reward fine-tuning, image-mask pairs

176. ❌ HairOrbit: Multi-view Aware 3D Hair Modeling from Single Portraits

作者: Leyang Jin, Yujian Zheng, Bingkui Tong, Yuda Qiu, Zhenyu Xie, Hao Li 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02867v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉和3D重建领域，研究从单视角肖像重建发丝级3D头发模型。虽然使用了视频生成模型的3D先验，但论文核心内容与所有评分关键词（均围绕大语言模型、深度学习技术原理及其在科学领域的应用）完全无关。论文未涉及任何语言模型、模型训练/微调技术、推理优化、对齐方法、代理系统或AI for Science的具体应用。

!!! tip deepseek-chat TL;DR

该论文提出了一种从单视角肖像重建发丝级3D头发的新框架，通过利用视频生成模型的3D先验将其转化为校准的多视角重建任务，并结合神经方向提取器和两阶段发丝生长算法，在可见和不可见区域均实现了最先进的性能。

摘要翻译

从单视角图像重建发丝级三维头发极具挑战性，尤其是在保持不可见区域一致且真实的头发属性方面。现有方法依赖于有限的正视角线索以及小规模或风格受限的合成数据，往往难以在不可见区域生成令人满意的结果。本文提出一种新颖框架，该框架利用视频生成模型的强大三维先验，将单视角头发重建转化为经过校准的多视角重建任务。为在重构后的多视角任务中平衡重建质量与效率，我们进一步引入一种基于稀疏真实图像标注训练的神经方向提取器，以实现更优的全视角方向估计。此外，我们设计了一种基于混合隐式场的两阶段发丝生长算法，能够以较快的速度合成具有精细细节的三维发丝曲线。大量实验表明，在多样化的头发肖像数据上，无论是可见区域还是不可见区域，我们的方法在单视角三维发丝重建任务中均达到了最先进的性能水平。

摘要 (Abstract)

Reconstructing strand-level 3D hair from a single-view image is highly challenging, especially when preserving consistent and realistic attributes in unseen regions. Existing methods rely on limited frontal-view cues and small-scale/style-restricted synthetic data, often failing to produce satisfactory results in invisible regions. In this work, we propose a novel framework that leverages the strong 3D priors of video generation models to transform single-view hair reconstruction into a calibrated multi-view reconstruction task. To balance reconstruction quality and efficiency for the reformulated multi-view task, we further introduce a neural orientation extractor trained on sparse real-image annotations for better full-view orientation estimation. In addition, we design a two-stage strand-growing algorithm based on a hybrid implicit field to synthesize the 3D strand curves with fine-grained details at a relatively fast speed. Extensive experiments demonstrate that our method achieves state-of-the-art performance on single-view 3D hair strand reconstruction on a diverse range of hair portraits in both visible and invisible regions.

关键词: 3D hair reconstruction, single-view image, strand-level modeling, multi-view reconstruction, video generation models, neural orientation extractor, strand-growing algorithm, hybrid implicit field

177. ❌ HiDiGen: Hierarchical Diffusion for B-Rep Generation with Explicit Topological Constraints

作者: Shurui Liu, Weide Chen, Ancong Wu 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02847v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文HiDiGen专注于计算机辅助设计（CAD）中边界表示（B-rep）的生成，采用分层扩散模型和Transformer架构解决几何与拓扑约束的耦合问题。所有评分关键词均与大语言模型、深度学习技术原理或科学AI应用直接相关，而本文研究的是3D形状生成和CAD建模，属于计算机图形学/几何深度学习领域，与评分关键词列表中的大模型技术、训练方法、推理优化、AI代理等主题无直接关联，也未涉及生物信息学或化学信息学等科学AI应用。因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

本文提出HiDiGen分层生成框架，通过解耦拓扑约束和几何建模，解决了CAD中边界表示（B-rep）的有效生成问题，实现了新颖、多样且拓扑正确的3D模型生成。

摘要翻译

边界表示（B-rep）是CAD系统中的标准三维建模格式，同时编码了几何基元与拓扑连接关系。尽管其应用广泛，但由于离散拓扑与连续几何之间复杂的相互作用，对有效B-rep结构进行深度生成建模仍具挑战性。本文提出HiDiGen——一种分层生成框架，该框架将几何建模解耦为两个阶段，每个阶段均受显式建模的拓扑约束引导。具体而言，我们的方法首先建立面-边关联关系以定义一致的拓扑骨架，并在此基础上生成面代理与初始边曲线。随后，采用多个基于Transformer的扩散模块，通过生成精确的面曲面与顶点位置来细化几何，同时动态建立并强制实施边-顶点邻接关系以保持结构一致性。这种渐进式几何层次结构能够生成更具新颖性与多样性的形状，而两阶段拓扑建模则确保了高有效性。实验结果表明，HiDiGen实现了优异性能，能够生成新颖、多样且拓扑结构合理的CAD模型。

摘要 (Abstract)

Boundary representation (B-rep) is the standard 3D modeling format in CAD systems, encoding both geometric primitives and topological connectivity. Despite its prevalence, deep generative modeling of valid B-rep structures remains challenging due to the intricate interplay between discrete topology and continuous geometry. In this paper, we propose HiDiGen, a hierarchical generation framework that decouples geometry modeling into two stages, each guided by explicitly modeled topological constraints. Specifically, our approach first establishes face-edge incidence relations to define a coherent topological scaffold, upon which face proxies and initial edge curves are generated. Subsequently, multiple Transformer-based diffusion modules are employed to refine the geometry by generating precise face surfaces and vertex positions, with edge-vertex adjacencies dynamically established and enforced to preserve structural consistency. This progressive geometry hierarchy enables the generation of more novel and diverse shapes, while two-stage topological modeling ensures high validity. Experimental results show that HiDiGen achieves strong performance, generating novel, diverse, and topologically sound CAD models.

关键词: B-rep generation, hierarchical diffusion, topological constraints, CAD modeling, 3D shape generation, Transformer-based diffusion, geometry refinement, structural consistency

178. ❌ Adaptive Local Frequency Filtering for Fourier-Encoded Implicit Neural Representations

作者: Ligen Shi, Jun Qiu, Yuhang Zheng, Chang Liu 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02846v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是隐式神经表示（INRs）中的傅里叶编码方法改进，属于计算机视觉和信号处理领域，专注于优化连续信号建模的局部频率适应性。所有评分关键词均与大模型、深度学习技术原理或科学AI应用直接相关，而本文的核心内容（傅里叶特征映射、神经正切核分析、图像/形状拟合）与这些关键词无直接关联，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对傅里叶编码隐式神经表示在处理空间变化局部频谱信号时收敛慢的问题，提出了一种自适应局部频率滤波方法，通过空间变化参数调制傅里叶分量，实验证明其在2D图像拟合、3D形状表示和稀疏数据重建中提高了重建质量并加速了优化。

摘要翻译

傅里叶编码隐式神经表示（INRs）在从离散样本中建模连续信号方面展现出强大能力。然而，传统的傅里叶特征映射在整个空间域中使用固定的频率集合，这使其难以适应具有空间变化局部频谱的信号，并常常导致高频细节收敛缓慢。为解决这一问题，我们提出了一种用于傅里叶编码INRs的自适应局部频率滤波方法。该方法引入了一个空间变化的参数$α(\mathbf{x})$来调制编码后的傅里叶分量，使得在不同空间位置能够平滑过渡到低通、带通或高通行为。我们进一步从神经正切核（Neural Tangent Kernel, NTK）的视角分析了所提出滤波器的效果，并基于NTK理论解释了其如何重塑有效核频谱。在二维图像拟合、三维形状表示和稀疏数据重建上的实验表明，与固定频率基线方法相比，所提方法能持续提升重建质量并实现更快的优化速度。此外，学习到的$α(\mathbf{x})$提供了对空间变化频率偏好的直观可视化，有助于解释模型在非平稳信号上的行为。这些结果表明，自适应局部频率调制是对傅里叶编码INRs的一种实用增强。

摘要 (Abstract)

Fourier-encoded implicit neural representations (INRs) have shown strong capability in modeling continuous signals from discrete samples. However, conventional Fourier feature mappings use a fixed set of frequencies over the entire spatial domain, making them poorly suited to signals with spatially varying local spectra and often leading to slow convergence of high-frequency details. To address this issue, we propose an adaptive local frequency filtering method for Fourier-encoded INRs. The proposed method introduces a spatially varying parameter $α(\mathbf{x})$ to modulate encoded Fourier components, enabling a smooth transition among low-pass, band-pass, and high-pass behaviors at different spatial locations. We further analyze the effect of the proposed filter from the neural tangent kernel (NTK) perspective and provide an NTK-inspired interpretation of how it reshapes the effective kernel spectrum. Experiments on 2D image fitting, 3D shape representation, and sparse data reconstruction demonstrate that the proposed method consistently improves reconstruction quality and leads to faster optimization compared with fixed-frequency baselines. In addition, the learned $α(\mathbf{x})$ provides an intuitive visualization of spatially varying frequency preferences, which helps explain the behavior of the model on non-stationary signals. These results indicate that adaptive local frequency modulation is a practical enhancement for Fourier-encoded INRs.

关键词: Implicit Neural Representations, Fourier encoding, adaptive local frequency filtering, neural tangent kernel, spatially varying spectra, signal reconstruction, optimization acceleration

179. ❌ Deformation-based In-Context Learning for Point Cloud Understanding

作者: Chengxing Lin, Jinhong Deng, Yinjie Lei, Wen Li 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02845v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	10.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于点云理解中的In-Context Learning（ICL），提出了DeformPIC框架，通过变形查询点云来实现几何推理。论文的核心关键词是’In-context Learning’，与评分关键词中的’In-context Learning OR Many-shot Learning’高度相关，因此该关键词得10分。其他关键词均涉及大语言模型（LLMs）或相关技术（如MoE、RLHF、RAG等），而本文研究的是点云数据的ICL，属于计算机视觉和几何处理领域，未涉及语言模型或文本处理，因此其他关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文针对点云In-Context Learning中现有方法依赖掩码重建导致几何推理不足和目标不一致的问题，提出了DeformPIC变形框架，在重建、去噪和配准任务上显著提升了性能，并展示了在未见数据分布上的泛化能力。

摘要翻译

点云上下文学习的最新进展已展现出强大的多任务处理能力。现有方法通常采用基于掩码点建模的范式来实现点云上下文学习。然而，基于掩码点建模的方法直接从掩码标记预测目标点云，未能充分利用几何先验，迫使模型仅能通过变换器的标记级相关性来推断空间结构与几何细节。此外，这些方法存在训练与推断目标不一致的问题：模型在训练时依赖推断阶段无法获取的目标侧信息来学习预测目标点云。为应对这些挑战，我们提出DeformPIC——一种基于形变的点云上下文学习框架。与依赖掩码重建的现有方法不同，DeformPIC学习在任务特定提示的指导下对查询点云进行形变，从而实现显式的几何推理并保持目标一致性。大量实验表明，DeformPIC在各项任务中均持续优于现有先进方法，在重建、去噪和配准任务上的平均倒角距离分别降低了1.6、1.8和4.7个点值。此外，我们引入了一个新的跨域基准测试来评估模型在未见数据分布上的泛化能力，DeformPIC在该测试中取得了最先进的性能表现。

摘要 (Abstract)

Recent advances in point cloud In-Context Learning (ICL) have demonstrated strong multitask capabilities. Existing approaches typically adopt a Masked Point Modeling (MPM)-based paradigm for point cloud ICL. However, MPM-based methods directly predict the target point cloud from masked tokens without leveraging geometric priors, requiring the model to infer spatial structure and geometric details solely from token-level correlations via transformers. Additionally, these methods suffer from a training-inference objective mismatch, as the model learns to predict the target point cloud using target-side information that is unavailable at inference time. To address these challenges, we propose DeformPIC, a deformation-based framework for point cloud ICL. Unlike existing approaches that rely on masked reconstruction, DeformPIC learns to deform the query point cloud under task-specific guidance from prompts, enabling explicit geometric reasoning and consistent objectives. Extensive experiments demonstrate that DeformPIC consistently outperforms previous state-of-the-art methods, achieving reductions of 1.6, 1.8, and 4.7 points in average Chamfer Distance on reconstruction, denoising, and registration tasks, respectively. Furthermore, we introduce a new out-of-domain benchmark to evaluate generalization across unseen data distributions, where DeformPIC achieves state-of-the-art performance.

关键词: In-Context Learning, point cloud, deformation, geometric reasoning, Chamfer Distance, reconstruction, denoising, registration

180. ❌ Factorized Multi-Resolution HashGrid for Efficient Neural Radiance Fields: Execution on Edge-Devices

作者: Kim Jun-Seong, Mingyu Kim, GeonU Kim, Tae-Hyun Oh, Jin-Hwa Kim 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02836v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	8.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于神经辐射场（NeRF）的on-device训练，提出了一种结合张量分解和哈希编码的新参数编码方法Fact-Hash。与关键词的相关性分析：1）与"Small Language Models OR SLMs OR On-device AI"高度相关（8分），因为论文核心是边缘设备上的AI部署和训练；2）与"Quantization OR Model Compression OR Low-bit Weights"有一定关联（5分），因为论文涉及内存效率优化和模型压缩技术；3）其他关键词（如LLMs、MoE、Scaling Laws等）与论文的3D表示和NeRF技术无直接关联，得0分。

!!! tip deepseek-chat TL;DR

论文提出了一种名为Fact-Hash的新型参数编码方法，通过结合张量分解和哈希编码技术，有效解决了神经辐射场在边缘设备上训练时的内存和计算资源限制问题，在保持渲染质量的同时显著提升了内存效率和能耗表现。

摘要翻译

本文提出Fact-Hash，一种用于设备端神经辐射场训练的新型参数编码方法。神经辐射场（Neural Radiance Fields, NeRF）已被证明在三维表征中具有关键作用，但其应用因需要大量计算资源而受限。设备端训练能够开辟广阔的应用领域，在通信受限、隐私保护需求以及快速适应频繁变化的场景中展现出优势。然而，有限的资源（GPU内存、存储空间和功耗）等挑战阻碍了其实际部署。为此，我们提出Fact-Hash，这是一种融合张量分解（Tensor Factorization）与哈希编码（Hash-encoding）技术的新型参数编码方法。该整合带来两大优势：利用丰富的高分辨率特征以及具备少样本鲁棒性。在Fact-Hash中，我们将三维坐标投影为多种低维形式（二维或一维），再应用哈希函数，最后将其聚合为单一特征。与前沿方法的对比评估表明，Fact-Hash在保持质量和渲染速度的同时，具有更优异的内存效率。相较于先前的编码方法，Fact-Hash在维持峰值信噪比（PSNR）值不变的情况下，节省了超过三分之一的内存使用量。设备端实验验证了Fact-Hash在计算效率和能耗方面相较于其他位置编码方法的优越性。这些发现表明，Fact-Hash是一种有前景的解决方案，能够改进特征网格表示、解决内存限制问题，并在多种应用中提升质量。项目页面：https://facthash.github.io/

摘要 (Abstract)

We introduce Fact-Hash, a novel parameter-encoding method for training on-device neural radiance fields. Neural Radiance Fields (NeRF) have proven pivotal in 3D representations, but their applications are limited due to large computational resources. On-device training can open large application fields, providing strength in communication limitations, privacy concerns, and fast adaptation to a frequently changing scene. However, challenges such as limited resources (GPU memory, storage, and power) impede their deployment. To handle this, we introduce Fact-Hash, a novel parameter-encoding merging Tensor Factorization and Hash-encoding techniques. This integration offers two benefits: the use of rich high-resolution features and the few-shot robustness. In Fact-Hash, we project 3D coordinates into multiple lower-dimensional forms (2D or 1D) before applying the hash function and then aggregate them into a single feature. Comparative evaluations against state-of-the-art methods demonstrate Fact-Hash’s superior memory efficiency, preserving quality and rendering speed. Fact-Hash saves memory usage by over one-third while maintaining the PSNR values compared to previous encoding methods. The on-device experiment validates the superiority of Fact-Hash compared to alternative positional encoding methods in computational efficiency and energy consumption. These findings highlight Fact-Hash as a promising solution to improve feature grid representation, address memory constraints, and improve quality in various applications. Project page: https://facthash.github.io/

关键词: Neural Radiance Fields, On-device Training, Parameter Encoding, Tensor Factorization, Hash Encoding, Memory Efficiency, Edge Devices, 3D Representation

作者: Hao Ren, Zetong Bi, Yiming Zeng, Zhaoliang Wan, Lu Qi, Hui Cheng 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02829v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文STRNet专注于机器人视觉导航，提出了一种基于动态图聚合的时空表示框架，用于增强视觉编码和动作预测。所有评分关键词均涉及大模型、深度学习技术原理或特定AI应用领域（如生物信息学），而该论文研究的是计算机视觉和机器人控制中的具体算法（时空融合、图推理、卷积模块），未涉及任何大模型技术、训练方法、推理优化或指定的科学AI应用。因此，所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对机器人视觉导航中因简单特征编码和时序池化导致细粒度时空结构丢失的问题，提出了一种统一的时空表示框架STRNet，通过空间图推理和混合时序移位模块增强视觉编码，实验证明其能持续提升导航性能并为目标条件控制提供通用视觉骨干。

摘要翻译

视觉导航要求机器人基于第一人称视角的视觉观测序列抵达指定目标（如图像）。尽管近期基于学习的方法已取得显著进展，但这些方法通常侧重于改进策略头或决策策略，而依赖于简单的特征编码器和时序池化来表示视觉输入。这导致细粒度的空间与时间结构信息丢失，最终限制了动作预测与进度估计的准确性。本文提出一种统一的时空表征框架，以增强机器人导航的视觉编码能力。我们的方法从图像序列与目标观测中提取特征，并通过设计的时空融合模块进行特征融合。该模块在每帧内部进行空间图推理，并利用混合时序移位模块结合多分辨率差分感知卷积来建模时序动态。实验结果表明，我们的方法能持续提升导航性能，并为目标条件控制提供了一个可泛化的视觉骨干网络。代码发布于 \href{https://github.com/hren20/STRNet}{https://github.com/hren20/STRNet}。

摘要 (Abstract)

Visual navigation requires the robot to reach a specified goal such as an image, based on a sequence of first-person visual observations. While recent learning-based approaches have made significant progress, they often focus on improving policy heads or decision strategies while relying on simplistic feature encoders and temporal pooling to represent visual input. This leads to the loss of fine-grained spatial and temporal structure, ultimately limiting accurate action prediction and progress estimation. In this paper, we propose a unified spatio-temporal representation framework that enhances visual encoding for robotic navigation. Our approach extracts features from both image sequences and goal observations, and fuses them using the designed spatio-temporal fusion module. This module performs spatial graph reasoning within each frame and models temporal dynamics using a hybrid temporal shift module combined with multi-resolution difference-aware convolution. Experimental results demonstrate that our approach consistently improves navigation performance and offers a generalizable visual backbone for goal-conditioned control. Code is available at \href{https://github.com/hren20/STRNet}{https://github.com/hren20/STRNet}.

关键词: Visual Navigation, Spatio-Temporal Representation, Dynamic Graph Aggregation, Feature Fusion, Temporal Dynamics, Goal-Conditioned Control, Robotic Navigation

182. ❌ MMPhysVideo: Scaling Physical Plausibility in Video Generation via Joint Multimodal Modeling

作者: Shubo Lin, Xuanyang Zhang, Wei Cheng, Weiming Hu, Gang Yu, Jin Gao 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02817v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于视频生成领域，提出MMPhysVideo框架通过联合多模态建模提升视频的物理合理性。虽然使用了视频扩散模型（VDMs）和视觉语言模型（VLM），但研究重点在于视频生成中的物理一致性、多模态表示和高效推理，而非大语言模型（LLMs）或深度学习技术原理的创新。所有关键词均与大语言模型、其训练方法、推理优化、对齐技术、代理系统或特定科学领域应用直接相关，而本论文未涉及这些主题，因此所有关键词相关度评分为0。

!!! tip deepseek-chat TL;DR

该论文解决了视频扩散模型生成物理不一致内容的问题，提出了MMPhysVideo框架，通过联合多模态建模和高效蒸馏方法，在多个基准测试中显著提升了视频生成的物理合理性和视觉质量，达到了最先进的性能。

摘要翻译

尽管在生成视觉震撼内容方面取得了进展，但仅基于像素重建的视频扩散模型（VDMs）常因物理不一致性而产生瑕疵。为解决此问题，我们提出了MMPhysVideo，这是首个通过联合多模态建模来规模化提升视频生成物理合理性的框架。我们将感知线索——特别是语义、几何和时空轨迹——重构为统一的伪RGB格式，使VDMs能够直接捕捉复杂的物理动态。为减轻跨模态干扰，我们提出了一种双向控制教师架构，该架构利用并行分支完全解耦RGB与感知处理，并采用两个零初始化控制链接逐步学习像素级一致性。为提升推理效率，教师的物理先验知识通过表示对齐被蒸馏至单流学生模型中。此外，我们开发了MMPhysPipe，一个可扩展的数据构建与标注流程，专门用于构建富含物理信息的多模态数据集。MMPhysPipe采用基于视觉证据链规则引导的视觉语言模型（VLM）来精确定位物理主体，使专家模型能够提取多粒度感知信息。在无需额外推理成本的情况下，MMPhysVideo在多个基准测试中持续提升了物理合理性与视觉质量，相较于现有方法实现了最先进的性能。

摘要 (Abstract)

Despite advancements in generating visually stunning content, video diffusion models (VDMs) often yield physically inconsistent results due to pixel-only reconstruction. To address this, we propose MMPhysVideo, the first framework to scale physical plausibility in video generation through joint multimodal modeling. We recast perceptual cues, specifically semantics, geometry, and spatio-temporal trajectory, into a unified pseudo-RGB format, enabling VDMs to directly capture complex physical dynamics. To mitigate cross-modal interference, we propose a Bidirectionally Controlled Teacher architecture, which utilizes parallel branches to fully decouple RGB and perception processing and adopts two zero-initialized control links to gradually learn pixel-wise consistency. For inference efficiency, the teacher’s physical prior is distilled into a single-stream student model via representation alignment. Furthermore, we present MMPhysPipe, a scalable data curation and annotation pipeline tailored for constructing physics-rich multimodal datasets. MMPhysPipe employs a vision-language model (VLM) guided by a chain-of-visual-evidence rule to pinpoint physical subjects, enabling expert models to extract multi-granular perceptual information. Without additional inference costs, MMPhysVideo consistently improves physical plausibility and visual quality over advanced models across various benchmarks and achieves state-of-the-art performance compared to existing methods.

关键词: video generation, physical plausibility, multimodal modeling, video diffusion models, representation alignment, data curation pipeline, vision-language model, state-of-the-art performance

183. ❌ UNICA: A Unified Neural Framework for Controllable 3D Avatars

作者: Jiahe Zhu, Xinyao Wang, Yiyu Zhuang, Yanwen Wang, Jing Tian, Yao Yao, Hao Zhu 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02799v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文UNICA专注于3D虚拟人控制，使用扩散模型和3D高斯泼溅技术，属于计算机图形学/计算机视觉领域。所有评分关键词均与大模型、深度学习技术原理或科学AI应用相关，而本文不涉及任何大模型技术（如LLM、MoE、SFT、RLHF等），也不属于生物信息学等科学AI应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为UNICA的统一神经框架，通过动作条件扩散模型和3D高斯泼溅技术，实现了无需骨架的3D虚拟人自动生成与控制，简化了传统复杂的虚拟人创建流程。

摘要翻译

可控三维人体化身在三维游戏、元宇宙及增强/虚拟现实场景中具有广泛应用。创建此类三维化身的传统方法需要冗长复杂的流程，涵盖外观建模、运动规划、骨骼绑定与物理模拟等多个环节。本文提出UNICA（统一神经可控化身），这是一种免骨骼绑定的生成模型，将全部化身控制组件整合至单一神经框架中。通过接收类似电子游戏控制的键盘输入，UNICA借助基于二维位置图运行的动作条件扩散模型，生成三维化身几何结构的下一帧。随后通过点变换器将生成的几何结构映射至三维高斯溅射模型，实现高保真自由视角渲染。该方法无需手动设计物理模拟即可自然捕捉头发与宽松衣物的动态效果，并支持超长序列自回归生成。据我们所知，UNICA是首个将“运动规划、骨骼绑定、物理模拟与渲染”工作流一体化的模型。代码发布于https://github.com/zjh21/UNICA。

摘要 (Abstract)

Controllable 3D human avatars have found widespread applications in 3D games, the metaverse, and AR/VR scenarios. The conventional approach to creating such a 3D avatar requires a lengthy, intricate pipeline encompassing appearance modeling, motion planning, rigging, and physical simulation. In this paper, we introduce UNICA (UNIfied neural Controllable Avatar), a skeleton-free generative model that unifies all avatar control components into a single neural framework. Given keyboard inputs akin to video game controls, UNICA generates the next frame of a 3D avatar’s geometry through an action-conditioned diffusion model operating on 2D position maps. A point transformer then maps the resulting geometry to 3D Gaussian Splatting for high-fidelity free-view rendering. Our approach naturally captures hair and loose clothing dynamics without manually designed physical simulation, and supports extra-long autoregressive generation. To the best of our knowledge, UNICA is the first model to unify the workflow of “motion planning, rigging, physical simulation, and rendering”. Code is released at https://github.com/zjh21/UNICA.

关键词: 3D avatars, controllable generation, diffusion model, 3D Gaussian Splatting, skeleton-free, autoregressive generation, motion planning, free-view rendering

184. ❌ CMCC-ReID: Cross-Modality Clothing-Change Person Re-Identification

作者: Haoxuan Xu, Hanzi Wang, Guanglin Niu 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02808v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉中的行人重识别任务，提出新任务CMCC-ReID并构建数据集SYSU-CMCC，开发了PIA网络解决跨模态和换衣的双重挑战。所有评分关键词均涉及大模型、深度学习技术原理或AI在科学领域的应用，而本文是传统的计算机视觉研究，未涉及大模型、LLM相关技术，也未应用于生物信息学等科学领域，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了跨模态换衣行人重识别新任务CMCC-ReID，构建了SYSU-CMCC数据集，并开发了渐进式身份对齐网络PIA，在解决模态差异和衣物变化双重挑战上显著优于现有方法。

摘要翻译

行人重识别（Person Re-Identification，ReID）在长期监控场景下面临着模态差异与衣着变化的严峻挑战。尽管现有研究在可见光-红外行人重识别（Visible-Infrared ReID，VI-ReID）或衣着变化行人重识别（Clothing-Change ReID，CC-ReID）任一领域已取得显著进展，但现实监控系统往往同时面临这两类挑战。为应对这一被忽视却实际存在的问题，我们定义了一项新任务，即跨模态衣着变化行人重识别（Cross-Modality Clothing-Change Re-Identification，CMCC-ReID），其目标是在模态与衣着双重变化下进行行人匹配。为推进该方向的研究，我们构建了一个新基准数据集SYSU-CMCC，其中每个身份均在可见光与红外域下以不同着装被捕获，体现了长期监控的双重异质性。为解决CMCC-ReID问题，我们提出了一种渐进式身份对齐网络（Progressive Identity Alignment Network，PIA），逐步缓解衣着变化与模态差异带来的影响。具体而言，双分支解耦学习（Dual-Branch Disentangling Learning，DBDL）模块将身份相关线索与衣着相关因素分离，以实现衣着无关的特征表达；双向原型学习（Bi-Directional Prototype Learning，BPL）模块在嵌入空间中进行模态内与模态间对比，以弥合模态间隙并进一步抑制衣着干扰。在SYSU-CMCC数据集上的大量实验表明，PIA为此新任务建立了坚实的基线，并显著优于现有方法。

摘要 (Abstract)

Person Re-Identification (ReID) faces severe challenges from modality discrepancy and clothing variation in long-term surveillance scenario. While existing studies have made significant progress in either Visible-Infrared ReID (VI-ReID) or Clothing-Change ReID (CC-ReID), real-world surveillance system often face both challenges simultaneously. To address this overlooked yet realistic problem, we define a new task, termed Cross-Modality Clothing-Change Re-Identification (CMCC-ReID), which targets pedestrian matching across variations in both modality and clothing. To advance research in this direction, we construct a new benchmark SYSU-CMCC, where each identity is captured in both visible and infrared domains with distinct outfits, reflecting the dual heterogeneity of long-term surveillance. To tackle CMCC-ReID, we propose a Progressive Identity Alignment Network (PIA) that progressively mitigates the issues of clothing variation and modality discrepancy. Specifically, a Dual-Branch Disentangling Learning (DBDL) module separates identity-related cues from clothing-related factors to achieve clothing-agnostic representation, and a Bi-Directional Prototype Learning (BPL) module performs intra-modality and inter-modality contrast in the embedding space to bridge the modality gap while further suppressing clothing interference. Extensive experiments on the SYSU-CMCC dataset demonstrate that PIA establishes a strong baseline for this new task and significantly outperforms existing methods.

关键词: Person Re-Identification, Cross-Modality, Clothing-Change, Visible-Infrared, Progressive Identity Alignment, Dual-Branch Disentangling Learning, Bi-Directional Prototype Learning, SYSU-CMCC dataset

185. ❌ CANDLE: Illumination-Invariant Semantic Priors for Color Ambient Lighting Normalization

作者: Rong-Lin Jian, Ting-Yao Chen, Yu-Fan Lin, Chia-Ming Lee, Fu-En Yang, Yu-Chiang Frank Wang, Chih-Chung Hsu 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02785v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文CANDLE专注于计算机视觉中的颜色环境光照归一化问题，利用DINOv3的自监督特征作为语义先验来解决多色光照下的颜色恢复挑战。所有评分关键词均与大语言模型、深度学习技术原理创新或AI在科学领域的应用直接相关，而本文研究的是计算机视觉中的特定图像处理任务，未涉及大模型技术、深度学习原理创新或AI在生物医药等科学领域的应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出CANDLE方法，利用DINOv3的自监督特征作为光照鲁棒的语义先验，解决了多色光照下颜色环境光照归一化的挑战，在CL3AN数据集上取得了比现有方法高1.22 dB PSNR的性能提升。

摘要翻译

在多色光照条件下进行色彩环境光照归一化极具挑战性，主要难点在于严重的色偏、高光饱和以及材料依赖的反射特性。当光照引起的色彩偏差占主导时，现有的几何先验和低层次先验不足以恢复物体本质颜色。我们观察到，DINOv3的自监督特征在有色光输入与环境光照真实值之间保持高度一致性，这启发了我们将其用作光照鲁棒的语义先验。我们提出CANDLE（基于DINO层增强的色彩环境光照归一化方法），该方法引入了DINO全层引导（D.O.G.）机制，将多层DINOv3特征自适应注入到连续的编码器阶段，并设计了色彩频率细化模块（BFACG + SFFB）以抑制解码器端的色彩崩溃与细节污染。在CL3AN数据集上的实验表明，本方法比现有最优方法提升了+1.22 dB的PSNR指标。CANDLE在NTIRE 2026 ALN色彩光照挑战赛中取得第三名，并在白光赛道保真度排名第二且获得最低FID分数，证实了其在色度主导与亮度主导的光照条件下均具备强大的泛化能力。代码已开源：https://github.com/ron941/CANDLE。

摘要 (Abstract)

Color ambient lighting normalization under multi-colored illumination is challenging due to severe chromatic shifts, highlight saturation, and material-dependent reflectance. Existing geometric and low-level priors are insufficient for recovering object-intrinsic color when illumination-induced chromatic bias dominates. We observe that DINOv3’s self-supervised features remain highly consistent between colored-light inputs and ambient-lit ground truth, motivating their use as illumination-robust semantic priors. We propose CANDLE (Color Ambient Normalization with DINO Layer Enhancement), which introduces DINO Omni-layer Guidance (D.O.G.) to adaptively inject multi-layer DINOv3 features into successive encoder stages, and a color-frequency refinement design (BFACG + SFFB) to suppress decoder-side chromatic collapse and detail contamination. Experiments on CL3AN show a +1.22 dB PSNR gain over the strongest prior method. CANDLE achieves 3rd place on the NTIRE 2026 ALN Color Lighting Challenge and 2nd place in fidelity on the White Lighting track with the lowest FID, confirming strong generalization across both chromatic and luminance-dominant illumination conditions. Code is available at https://github.com/ron941/CANDLE.

关键词: Color ambient lighting normalization, DINOv3, Semantic priors, Illumination-robust, Multi-colored illumination, Chromatic bias, Image enhancement, Computer vision

186. ❌ Generalized Small Object Detection:A Point-Prompted Paradigm and Benchmark

作者: Haoran Zhu, Wen Yang, Guangyou Yang, Chang Xu, Ruixiang Zhang, Fang Xu, Haijian Zhang, Gui-Song Xia 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02773v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉领域的小目标检测（SOD），提出新的数据集TinySet-9M和点提示检测范式P2SOD。所有评分关键词均与大语言模型、深度学习技术原理或AI for Science直接相关，而本文研究的是传统视觉检测任务，未涉及大模型、MoE、缩放定律、训练技术、推理优化、智能体、量化等关键词内容，也未应用于生物信息学等科学领域，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对小目标检测中数据稀缺和语义表示弱的问题，提出了大规模多领域数据集TinySet-9M和点提示检测范式P2SOD，并开发了DEAL框架，在严格定位指标上实现了31.4%的相对性能提升。

摘要翻译

小目标检测因像素信息极度有限且目标边界模糊而持续面临挑战。这些特性导致标注难度高、大规模高质量数据集稀缺，以及小目标固有的语义表征薄弱。本研究首先通过引入首个面向小目标检测的大规模多领域数据集TinySet-9M，以解决数据匮乏问题。在填补大规模数据集空白的基础上，我们建立了评估现有高效标注检测方法在小目标检测中有效性的基准。评估结果表明，微弱的视觉线索会进一步加剧高效标注方法在小目标检测中的性能衰减，这凸显了高效标注小目标检测领域的关键挑战。其次，为应对语义表征不足的局限，我们突破训练阶段特征增强的范式，提出了一种称为点提示小目标检测的新范式。该范式在推理阶段引入稀疏点提示作为类别级定位的高效信息桥梁，从而实现语义增强。基于P2SOD范式与大规模TinySet-9M数据集，我们进一步开发了DEAL框架——一个可扩展、可迁移的点提示检测框架，能够从大规模数据中学习鲁棒的提示条件表征。在推理阶段仅需单次点击，DEAL在TinySet-9M数据集上采用严格定位指标（如AP75）评估时，相较全监督基线实现了31.4%的相对性能提升，同时能有效泛化至未见过的类别与数据集。本项目可通过https://zhuhaoraneis.github.io/TinySet-9M/访问。

摘要 (Abstract)

Small object detection (SOD) remains challenging due to extremely limited pixels and ambiguous object boundaries. These characteristics lead to challenging annotation, limited availability of large-scale high-quality datasets, and inherently weak semantic representations for small objects. In this work, we first address the data limitation by introducing TinySet-9M, the first large-scale, multi-domain dataset for small object detection. Beyond filling the gap in large-scale datasets, we establish a benchmark to evaluate the effectiveness of existing label-efficient detection methods for small objects. Our evaluation reveals that weak visual cues further exacerbate the performance degradation of label-efficient methods in small object detection, highlighting a critical challenge in label-efficient SOD. Secondly, to tackle the limitation of insufficient semantic representation, we move beyond training-time feature enhancement and propose a new paradigm termed Point-Prompt Small Object Detection (P2SOD). This paradigm introduces sparse point prompts at inference time as an efficient information bridge for category-level localization, enabling semantic augmentation. Building upon the P2SOD paradigm and the large-scale TinySet-9M dataset, we further develop DEAL (DEtect Any smalL object), a scalable and transferable point-prompted detection framework that learns robust, prompt-conditioned representations from large-scale data. With only a single click at inference time, DEAL achieves a 31.4% relative improvement over fully supervised baselines under strict localization metrics (e.g., AP75) on TinySet-9M, while generalizing effectively to unseen categories and unseen datasets. Our project is available at https://zhuhaoraneis.github.io/TinySet-9M/.

关键词: small object detection, point-prompted paradigm, TinySet-9M dataset, label-efficient detection, semantic augmentation, DEAL framework, generalization, localization metrics

187. ❌ A Unified Perspective on Adversarial Membership Manipulation in Vision Models

作者: Ruize Gao, Kaiwen Zhou, Yongqiang Chen, Feng Liu 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02780v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉模型的对抗性成员推理攻击和防御，研究领域为计算机视觉安全与隐私。所有评分关键词均针对大语言模型（LLM）和深度学习技术原理的创新，包括模型架构、训练方法、推理优化、对齐技术、代理系统等特定方向。论文内容完全不涉及这些关键词：1）研究对象是视觉模型而非语言模型；2）不讨论MoE、SLMs、Scaling Laws等特定技术；3）不涉及预训练、微调、对齐、RLHF等训练方法；4）不涉及RAG、长上下文、注意力优化等推理技术；5）不涉及思维链、系统2思考、MCTS等推理方法；6）不涉及自我改进、代理、工具使用等多智能体系统；7）不涉及量化、推测解码等优化技术；8）不涉及幻觉缓解、可解释性等评估方向；9）不涉及世界模型、模型融合、上下文学习等高级能力；10）不涉及科学AI的具体应用。因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文首次系统研究了视觉模型中对抗性成员操纵现象，发现微小扰动可将非成员图像伪装成成员，并提出基于梯度几何特征的检测方法和鲁棒推理框架来有效防御此类攻击。

摘要翻译

成员推理攻击旨在判定特定数据点是否属于模型训练集，是评估视觉模型隐私泄露的有效工具。然而，现有成员推理攻击方法隐含假设查询输入是诚实的，其对抗鲁棒性尚未得到充分探索。我们发现，针对视觉模型的成员推理攻击暴露了一个先前被忽视的对抗面：对抗性成员操纵，即通过难以察觉的扰动可稳定地将非成员图像推入先进成员推理攻击的“成员”区域。本文首次通过分析其机制与影响，为这一现象提供了统一视角。我们首先证明对抗性成员伪造在不同架构与数据集上均持续有效。随后，我们揭示了一种独特的几何特征——梯度范数塌缩轨迹——尽管伪造成员与真实成员的语义表征几乎相同，该特征仍能可靠地区分二者。基于此发现，我们提出了一种基于梯度几何信号的理论检测策略，并开发了一个能显著减轻对抗操纵的鲁棒推理框架。大量实验表明，成员伪造具有广泛有效性，而我们的检测与鲁棒推理策略显著提升了防御能力。本研究首次建立了视觉模型中对抗性成员操纵的完整分析框架。

摘要 (Abstract)

Membership inference attacks (MIAs) aim to determine whether a specific data point was part of a model’s training set, serving as effective tools for evaluating privacy leakage of vision models. However, existing MIAs implicitly assume honest query inputs, and their adversarial robustness remains unexplored. We show that MIAs for vision models expose a previously overlooked adversarial surface: adversarial membership manipulation, where imperceptible perturbations can reliably push non-member images into the “member” region of state-of-the-art MIAs. In this paper, we provide the first unified perspective on this phenomenon by analyzing its mechanism and implications. We begin by demonstrating that adversarial membership fabrication is consistently effective across diverse architectures and datasets. We then reveal a distinctive geometric signature - a characteristic gradient-norm collapse trajectory - that reliably separates fabricated from true members despite their nearly identical semantic representations. Building on this insight, we introduce a principled detection strategy grounded in gradient-geometry signals and develop a robust inference framework that substantially mitigates adversarial manipulation. Extensive experiments show that fabrication is broadly effective, while our detection and robust inference strategies significantly enhance resilience. This work establishes the first comprehensive framework for adversarial membership manipulation in vision models.

关键词: Membership Inference Attacks, Adversarial Membership Manipulation, Vision Models, Privacy Leakage, Gradient Geometry, Robust Inference, Adversarial Robustness, Detection Strategy

188. ❌ InverseDraping: Recovering Sewing Patterns from 3D Garment Surfaces via BoxMesh Bridging

作者: Leyang Jin, Zirong Jin, Zisheng Ye, Haokai Pang, Xiaoguang Han, Yujian Zheng, Hao Li 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02764v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机图形学中的逆向工程问题，研究如何从3D服装表面恢复缝纫图案。论文的核心技术是几何处理、结构化表示（BoxMesh）和自回归建模，完全不涉及大语言模型、深度学习技术原理创新或任何评分关键词中的技术。论文属于计算机视觉/图形学领域，而非大模型或深度学习技术研究，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为InverseDraping的两阶段框架，通过BoxMesh中间表示解决了从3D服装表面恢复参数化缝纫图案的逆问题，在GarmentCodeData基准测试中实现了最先进的性能。

摘要翻译

从悬垂的三维服装中重建缝纫版型是人体数字化研究中的一个挑战性问题。与使用成熟的物理模拟引擎对设计好的缝纫版型进行悬垂模拟这一已被深入研究的正向过程相比，从变形的服装几何体中恢复参数化二维版型的逆向过程，对于现有方法而言本质上仍是一个不适定问题。我们提出了一个两阶段框架，其核心是一种结构化的中间表示——BoxMesh，它是连接三维服装几何体与参数化缝纫版型的关键桥梁。BoxMesh在三维空间中同时编码了服装级别的几何信息与裁片级别的结构信息，并明确地将裁片固有的几何形状和缝合拓扑从悬垂引起的变形中解耦出来。这种表示为问题施加了基于物理的结构约束，显著降低了模糊性。在第一阶段，一个几何驱动的自回归模型从输入的三维服装中推断出BoxMesh。在第二阶段，一个语义感知的自回归模型将BoxMesh解析为参数化缝纫版型。我们采用自回归建模，以自然地处理裁片配置与缝合关系所具有的可变长度和结构化特性。这种分解将几何反演与结构化版型推断分离开来，从而实现了更准确、更稳健的重建。大量实验表明，我们的方法在GarmentCodeData基准测试上达到了最先进的性能，并能有效地泛化到真实世界扫描数据和单视图图像。

摘要 (Abstract)

Recovering sewing patterns from draped 3D garments is a challenging problem in human digitization research. In contrast to the well-studied forward process of draping designed sewing patterns using mature physical simulation engines, the inverse process of recovering parametric 2D patterns from deformed garment geometry remains fundamentally ill-posed for existing methods. We propose a two-stage framework that centers on a structured intermediate representation, BoxMesh, which serves as the key to bridging the gap between 3D garment geometry and parametric sewing patterns. BoxMesh encodes both garment-level geometry and panel-level structure in 3D, while explicitly disentangling intrinsic panel geometry and stitching topology from draping-induced deformations. This representation imposes a physically grounded structure on the problem, significantly reducing ambiguity. In Stage I, a geometry-driven autoregressive model infers BoxMesh from the input 3D garment. In Stage II, a semantics-aware autoregressive model parses BoxMesh into parametric sewing patterns. We adopt autoregressive modeling to naturally handle the variable-length and structured nature of panel configurations and stitching relationships. This decomposition separates geometric inversion from structured pattern inference, leading to more accurate and robust recovery. Extensive experiments demonstrate that our method achieves state-of-the-art performance on the GarmentCodeData benchmark and generalizes effectively to real-world scans and single-view images.

关键词: sewing pattern recovery, 3D garment surfaces, BoxMesh representation, autoregressive modeling, inverse draping, parametric patterns, geometry-driven inference, semantics-aware parsing

189. ❌ Differentiable Stroke Planning with Dual Parameterization for Efficient and High-Fidelity Painting Creation

作者: Jinfan Liu, Wuze Zhang, Zhangli Hu, Zhehan Zhao, Ye Chen, Bingbing Ni 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02752v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是基于笔画的渲染技术，提出了一种结合离散折线和连续贝塞尔控制点的双重表示方法，用于优化笔画布局和减少优化时间。论文内容完全聚焦于计算机图形学中的笔画渲染和优化算法，不涉及任何大语言模型、深度学习技术原理、AI科学应用或相关技术关键词。所有评分关键词均与大语言模型、深度学习技术、AI科学应用相关，与该论文的图形学渲染主题完全无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对基于笔画的渲染中离散笔画放置易陷入局部最优和可微分优化器缺乏结构感知的问题，提出了一种结合离散折线和连续贝塞尔控制点的双重表示方法，实现了协同优化，减少了30-50%的笔画数量，提高了结构一致性，并将优化时间缩短了30-40%。

摘要翻译

在基于笔画的渲染中，由于离散笔画布局的离散性，搜索方法常陷入局部最优，而可微分优化器则缺乏结构感知能力，易产生非结构化的布局。为弥合这一差距，我们提出了一种双重表示方法，通过双向映射机制将离散折线与连续的贝塞尔控制点耦合。该表示支持协同优化：局部梯度可细化全局笔画结构，而内容感知的笔画提议则有助于跳出不良局部最优解。我们的表示进一步支持受高斯溅射启发的初始化方式，实现了图像中高度并行的笔画优化。实验表明，相较于现有可微分矢量化方法，我们的方法将笔画数量减少了30-50%，获得了更具结构一致性的布局，提升了重建质量，同时将优化时间缩短了30-40%。

摘要 (Abstract)

In stroke-based rendering, search methods often get trapped in local minima due to discrete stroke placement, while differentiable optimizers lack structural awareness and produce unstructured layouts. To bridge this gap, we propose a dual representation that couples discrete polylines with continuous Bézier control points via a bidirectional mapping mechanism. This enables collaborative optimization: local gradients refine global stroke structures, while content-aware stroke proposals help escape poor local optima. Our representation further supports Gaussian-splatting-inspired initialization, enabling highly parallel stroke optimization across the image. Experiments show that our approach reduces the number of strokes by 30-50%, achieves more structurally coherent layouts, and improves reconstruction quality, while cutting optimization time by 30-40% compared to existing differentiable vectorization methods.

关键词: stroke-based rendering, differentiable optimization, dual representation, Bézier control points, stroke planning, vectorization, Gaussian-splatting, parallel optimization

190. ❌ DeCo-DETR: Decoupled Cognition DETR for efficient Open-Vocabulary Object Detection

作者: Siheng Wang, Yanshu Li, Bohan Hu, Zhengdao Li, Haibo Zhan, Linshan Li, Weiming Liu, Ruizhi Qian, Guangxin Wu, Hao Zhang, Jifeng Shen, Piotr Koniusz, Zhengtao Yao, Junhao Dong, Qiang Sun 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02753v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文DeCo-DETR专注于开放词汇目标检测（OVOD），属于计算机视觉领域，而非大语言模型（LLMs）的直接研究。然而，它利用预训练的大型视觉语言模型（LVLMs）和CLIP来构建语义原型空间，这涉及基础模型（如CLIP）的应用，因此与’Large Language Models OR LLMs OR Foundation Models’有一定关联（5分）。同时，论文提到使用预训练的LVLMs，这涉及’Pre-training OR Continual Pre-training OR Domain Adaptation’（5分），并通过对齐策略优化模型，与’Instruction Tuning OR Alignment OR Value Alignment’相关（5分）。其他关键词如MoE、SLMs、Scaling Laws、RLHF等均未在论文中提及或相关，因此评分为0分。论文的核心是视觉检测框架，而非大模型技术原理或科学AI应用，因此相关性有限。

!!! tip deepseek-chat TL;DR

该论文提出DeCo-DETR，一个解耦认知的视觉中心框架，通过构建层次化语义原型空间和解耦训练策略，解决了开放词汇目标检测中计算效率低和泛化能力受限的问题，在标准基准上实现了竞争性的零样本检测性能和显著提升的推理效率。

摘要翻译

开放词汇目标检测（Open-vocabulary Object Detection, OVOD）使模型能够识别超出预定义类别的物体，但现有方法在实际部署中仍存在局限。一方面，多模态设计由于在推理时依赖文本编码器，往往带来显著的计算开销；另一方面，紧密耦合的训练目标在封闭集检测精度与开放世界泛化能力之间引入了权衡。为此，我们提出解耦认知DETR（Decoupled Cognition DETR, DeCo-DETR），这是一个以视觉为中心的框架，通过统一的解耦范式应对上述挑战。DeCo-DETR不依赖在线文本编码，而是基于预训练的大型视觉语言模型（LVLMs）生成的区域级描述，并借助CLIP进行对齐，构建了一个层次化的语义原型空间，从而实现高效且可复用的语义表示。基于此表示，该框架进一步通过解耦训练策略将语义推理与定位分离，将对齐任务和检测任务分解为并行的优化流。在标准OVOD基准上的大量实验表明，DeCo-DETR在实现竞争力的零样本检测性能的同时，显著提升了推理效率。这些结果凸显了将语义认知与检测解耦的有效性，为可扩展的OVOD系统提供了实用方向。

摘要 (Abstract)

Open-vocabulary Object Detection (OVOD) enables models to recognize objects beyond predefined categories, but existing approaches remain limited in practical deployment. On the one hand, multimodal designs often incur substantial computational overhead due to their reliance on text encoders at inference time. On the other hand, tightly coupled training objectives introduce a trade-off between closed-set detection accuracy and open-world generalization. Thus, we propose Decoupled Cognition DETR (DeCo-DETR), a vision-centric framework that addresses these challenges through a unified decoupling paradigm. Instead of depending on online text encoding, DeCo-DETR constructs a hierarchical semantic prototype space from region-level descriptions generated by pre-trained LVLMs and aligned via CLIP, enabling efficient and reusable semantic representation. Building upon this representation, the framework further disentangles semantic reasoning from localization through a decoupled training strategy, which separates alignment and detection into parallel optimization streams. Extensive experiments on standard OVOD benchmarks demonstrate that DeCo-DETR achieves competitive zero-shot detection performance while significantly improving inference efficiency. These results highlight the effectiveness of decoupling semantic cognition from detection, offering a practical direction for scalable OVOD systems.

关键词: Open-vocabulary Object Detection, Decoupled Cognition, DETR, Semantic Prototype Space, Vision-centric Framework, Zero-shot Detection, Inference Efficiency, LVLMs

191. ❌ Visual Instruction-Finetuned Language Model for Versatile Brain MR Image Tasks

作者: Jonghun Kim, Sinyoung Ra, Hyunjin Park 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02748v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心是LLM在医学影像（脑MRI）领域的应用，涉及视觉指令微调（与SFT和Instruction Tuning高度相关），属于AI for Science范畴。其他关键词如MoE、量化、推理加速等未涉及。

!!! tip deepseek-chat TL;DR

该论文提出LLaBIT模型，通过视觉指令微调将大语言模型扩展应用于脑MRI的多种临床任务（报告生成、视觉问答、图像分割和图像翻译），并在多项任务上超越专用模型。

摘要翻译

大型语言模型（LLMs）在语言推理方面展现出卓越能力，并日益擅长视觉-语言任务。将图像标记（image tokens）整合到Transformer架构中，实现了直接的视觉输入与输出，推动了从图像到文本描述至文本到图像生成的研究进展。然而，简单的文本到图像生成在临床应用中价值有限。在医学影像领域，诸如用于定位病变的图像分割或用于重建缺失序列的图像转换等任务具有更高的临床重要性。尽管如此，将这些多样化且临床相关的任务整合到一个统一、多功能的语言模型中，仍属未探索领域。我们的方法LLaBIT（用于脑部图像转换的大型语言模型）将LLMs的视觉推理能力扩展至脑部MRI领域的这些具有临床意义的任务中。为减轻图像标记化过程中固有的空间信息损失，我们引入了一种重用图像编码器特征图的机制，以最小化数据退化。同时，我们利用LLMs在严格预定义指令下生成文本数据，以增强脑部MRI中有限的图像-文本配对数据。我们在五个脑部MRI数据集上，针对四项不同任务——报告生成、视觉问答、图像分割和图像转换——对我们的方法进行了全面评估。我们的模型不仅在所有任务中均表现出优越性能，而且在直接比较中超越了针对特定任务设计的专用模型，凸显了其高效性与多功能性。

摘要 (Abstract)

LLMs have demonstrated remarkable capabilities in linguistic reasoning and are increasingly adept at vision-language tasks. The integration of image tokens into transformers has enabled direct visual input and output, advancing research from image-to-text descriptions to text-to-image generation. However, simple text-to-image generation holds limited clinical utility. In medical imaging, tasks such as image segmentation for localizing pathologies or image translation for reconstructing missing sequences have much greater clinical importance. Despite this, integrating these diverse, clinically relevant tasks within a single, versatile language model remains unexplored. Our method, LLaBIT (Large Language Model for Brain Image Translation), extends the visual reasoning of LLMs to these clinically meaningful tasks in the brain MRI domain. To mitigate the spatial information loss inherent in image tokenization, we incorporate a mechanism to reuse feature maps from the image encoder, minimizing data degradation. We also generate text data using LLMs with strict predefined instructions to augment limited image-text paired data in brain MRI. We comprehensively evaluated our method on five brain MRI datasets across four distinct tasks: report generation, visual question answering, image segmentation, and image translation. Our model not only demonstrated superior performance across all tasks but also outperformed specialized, task-specific models in direct comparisons, highlighting its efficacy and versatility

关键词: Large Language Models, Visual Instruction Finetuning, Brain MRI, Medical Imaging, Image Segmentation, Image Translation, Multimodal AI, Clinical Applications

192. ❌ Task-Guided Prompting for Unified Remote Sensing Image Restoration

作者: Wenli Huang, Yang Wu, Xiaomeng Xin, Zhihong Liu, Jinjun Wang, Ye Deng 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02742v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于遥感图像修复（RSIR）领域，提出了一种名为TGPNet的统一框架来处理多种退化类型（去噪、去云、去阴影、去模糊、SAR去斑）。论文的核心创新是任务引导提示（TGP）策略，这是一种基于深度学习的计算机视觉方法，用于多任务学习。所有关键词（除了最后一个）都直接与大语言模型（LLMs）或相关的深度学习技术（如MoE、缩放定律、对齐、推理、代理等）相关，而论文完全不涉及LLMs或这些特定技术。最后一个关键词“AI for Science OR Bioinformatics OR Cheminformatics”得5分，因为遥感图像修复可以被视为“AI for Science”的一个应用领域（科学数据分析），但它并非核心焦点，且不涉及生物信息学或化学信息学。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为TGPNet的统一深度学习框架，通过任务引导提示策略解决了遥感图像中多种退化类型（如噪声、云层、阴影）的修复问题，并在构建的多模态基准上实现了最先进的性能。

摘要翻译

遥感图像复原（RSIR）对于从退化观测中恢复高保真影像、实现精准的下游分析至关重要。然而，现有方法大多专注于同质数据内的单一退化类型，这限制了其在真实场景中的实用性，因为实际场景中常存在跨越不同光谱波段或传感器模态的多种退化，形成了显著的操作瓶颈。为弥补这一根本性不足，我们提出了TGPNet，一个能够在一个统一架构内处理去噪、云去除、阴影去除、去模糊和SAR去斑的统一框架。我们框架的核心是一种新颖的任务引导提示（Task-Guided Prompting, TGP）策略。TGP利用可学习的、任务特定的嵌入来生成感知退化的提示，随后这些提示在解码器中分层调制特征。这种任务自适应机制使网络能够针对不同的退化模式精确调整其复原过程，同时仅维护一组共享权重。为验证我们的框架，我们构建了一个统一的RSIR基准测试集，涵盖RGB、多光谱、SAR和热红外模态，用于上述五种复原任务。实验结果表明，TGPNet在统一的多任务场景和未见过的复合退化情况下均达到了最先进的性能，甚至在云去除等单一领域超越了专用模型。通过成功地将异构退化去除统一在一个自适应框架内，本工作为多任务RSIR领域带来了重要进展，为业务化处理流程提供了一个实用且可扩展的解决方案。代码与基准测试集将在 https://github.com/huangwenwenlili/TGPNet 发布。

摘要 (Abstract)

Remote sensing image restoration (RSIR) is essential for recovering high-fidelity imagery from degraded observations, enabling accurate downstream analysis. However, most existing methods focus on single degradation types within homogeneous data, restricting their practicality in real-world scenarios where multiple degradations often across diverse spectral bands or sensor modalities, creating a significant operational bottleneck. To address this fundamental gap, we propose TGPNet, a unified framework capable of handling denoising, cloud removal, shadow removal, deblurring, and SAR despeckling within a single, unified architecture. The core of our framework is a novel Task-Guided Prompting (TGP) strategy. TGP leverages learnable, task-specific embeddings to generate degradation-aware cues, which then hierarchically modulate features throughout the decoder. This task-adaptive mechanism allows the network to precisely tailor its restoration process for distinct degradation patterns while maintaining a single set of shared weights. To validate our framework, we construct a unified RSIR benchmark covering RGB, multispectral, SAR, and thermal infrared modalities for five aforementioned restoration tasks. Experimental results demonstrate that TGPNet achieves state-of-the-art performance on both unified multi-task scenarios and unseen composite degradations, surpassing even specialized models in individual domains such as cloud removal. By successfully unifying heterogeneous degradation removal within a single adaptive framework, this work presents a significant advancement for multi-task RSIR, offering a practical and scalable solution for operational pipelines. The code and benchmark will be released at https://github.com/huangwenwenlili/TGPNet.

关键词: Remote Sensing Image Restoration, Unified Framework, Task-Guided Prompting, Multi-task Learning, Degradation Removal, Multi-modal Benchmark, Deep Learning, Computer Vision

193. ❌ ExploreVLA: Dense World Modeling and Exploration for End-to-End Autonomous Driving

作者: Zihao Sheng, Xin Ye, Jingru Luo, Sikai Chen, Liu Ren 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02714v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于端到端自动驾驶，提出了一种结合世界建模和强化学习的VLA（Vision-Language-Action）框架。虽然涉及AI应用，但核心内容与提供的关键词高度不匹配：仅与’World Models AND General World Models’直接相关（评分10分），因为论文明确提出了密集世界建模（dense world modeling）方法，用于预测未来RGB和深度图像以支持探索和规划。其他关键词均未在标题或摘要中提及，且论文未涉及大语言模型（LLMs）、模型训练技术（如MoE、SFT、RLHF）、推理优化（如RAG、Attention机制）、代理系统或科学AI应用，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对端到端自动驾驶中模仿学习泛化能力不足的问题，提出了一种结合世界建模和强化学习的框架，通过生成未来图像作为密集监督和内在奖励，在NAVSIM和nuScenes基准上实现了最先进的性能。

摘要翻译

基于视觉-语言-动作（Vision-Language-Action, VLA）架构的端到端自动驾驶模型，通过在专家示范数据上进行行为克隆学习驾驶策略，已展现出有前景的结果。然而，模仿学习本质上将模型限制于复制已观察到的行为，无法探索多样化的驾驶策略，导致其在新颖或分布外场景中表现脆弱。强化学习（Reinforcement Learning, RL）提供了一种自然的补救方案，它允许策略在专家分布之外进行探索。但通常基于离线数据集训练的VLA模型缺乏可直接观测的状态转移，因此需要一个学习得到的世界模型来预测动作的后果。在本工作中，我们提出了一个统一的理解与生成框架，该框架利用世界建模来同时实现有意义的探索并提供密集的监督。具体而言，我们通过未来RGB图像和深度图像生成作为密集的世界建模目标来增强轨迹预测，这要求模型学习细粒度的视觉与几何表征，从而极大地丰富了规划主干网络。除了作为监督信号外，世界模型还进一步充当了策略探索的内在奖励来源：其图像预测的不确定性自然地衡量了轨迹相对于训练分布的新颖性，其中高不确定性指示了分布外场景，若该场景安全，则代表了宝贵的学习机会。我们将此探索信号整合到一个安全门控的奖励中，并通过组相对策略优化（Group Relative Policy Optimization, GRPO）来优化策略。在NAVSIM和nuScenes基准测试上的实验证明了我们方法的有效性，在NAVSIM上实现了93.7的PDMS分数和88.8的EPDMS分数，达到了最先进水平。代码与演示将在 https://zihaosheng.github.io/ExploreVLA/ 公开。

摘要 (Abstract)

End-to-end autonomous driving models based on Vision-Language-Action (VLA) architectures have shown promising results by learning driving policies through behavior cloning on expert demonstrations. However, imitation learning inherently limits the model to replicating observed behaviors without exploring diverse driving strategies, leaving it brittle in novel or out-of-distribution scenarios. Reinforcement learning (RL) offers a natural remedy by enabling policy exploration beyond the expert distribution. Yet VLA models, typically trained on offline datasets, lack directly observable state transitions, necessitating a learned world model to anticipate action consequences. In this work, we propose a unified understanding-and-generation framework that leverages world modeling to simultaneously enable meaningful exploration and provide dense supervision. Specifically, we augment trajectory prediction with future RGB and depth image generation as dense world modeling objectives, requiring the model to learn fine-grained visual and geometric representations that substantially enrich the planning backbone. Beyond serving as a supervisory signal, the world model further acts as a source of intrinsic reward for policy exploration: its image prediction uncertainty naturally measures a trajectory’s novelty relative to the training distribution, where high uncertainty indicates out-of-distribution scenarios that, if safe, represent valuable learning opportunities. We incorporate this exploration signal into a safety-gated reward and optimize the policy via Group Relative Policy Optimization (GRPO). Experiments on the NAVSIM and nuScenes benchmarks demonstrate the effectiveness of our approach, achieving a state-of-the-art PDMS score of 93.7 and an EPDMS of 88.8 on NAVSIM. The code and demo will be publicly available at https://zihaosheng.github.io/ExploreVLA/.

关键词: autonomous driving, world modeling, reinforcement learning, Vision-Language-Action, exploration, trajectory prediction, intrinsic reward, safety-gated reward

194. ❌ A Rapid Instrument Exchange System for Humanoid Robots in Minimally Invasive Surgery

作者: Bingcong Zhang, Yihang Lyv, Lianbo Ma, Yushi He, Pengfei Wei, Xingchi Liu, Jinhua Li, Jianchang Zhao, Lizhi Pan 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02707v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于人形机器人手术中的器械快速交换系统，涉及机器人控制、机械设计、遥操作和实时感知等技术，但完全不涉及大模型、深度学习、AI算法或相关技术原理，所有关键词均与大模型/深度学习技术相关，与论文内容无任何关联。

!!! tip deepseek-chat TL;DR

本文提出了一种用于人形机器人微创手术的沉浸式遥操作快速器械交换系统，通过低延迟对接机制和实时第一人称视角感知，显著降低了操作复杂性并验证了在受限临床环境中稳定执行器械交换的技术可行性。

摘要翻译

人形机器人技术在微创手术领域展现出巨大潜力。与专用的多臂手术平台不同，人形机器人固有的双臂构型需要高效的器械交换能力以执行复杂手术步骤，从而模拟外科医生手动切换器械的自然工作流程。为此，本文提出一种沉浸式遥操作快速器械交换系统。该系统采用基于单轴顺应性对接与环境约束释放的低延迟机制，结合通过头戴式显示器实现的实时第一人称视角感知，显著降低了对接过程中的操作复杂性与认知负荷。专家与新手的对比评估表明，该系统具有较高的操作鲁棒性及快速收敛的学习曲线；新手经过简短训练后，在器械安装与拆卸任务中的表现得到显著提升。尽管远距离空间对准在时间成本与协作稳定性方面仍存在挑战，本研究成功验证了人形机器人在受限临床环境中执行稳定器械交换的技术可行性。

摘要 (Abstract)

Humanoid robot technologies have demonstrated immense potential for minimally invasive surgery (MIS). Unlike dedicated multi-arm surgical platforms, the inherent dual-arm configuration of humanoid robots necessitates an efficient instrument exchange capability to perform complex procedures, mimicking the natural workflow where surgeons manually switch instruments. To address this, this paper proposes an immersive teleoperated rapid instrument exchange system. The system utilizes a low-latency mechanism based on single-axis compliant docking and environmental constraint release. Integrated with real-time first-person view (FPV) perception via a head-mounted display (HMD), this framework significantly reduces operational complexity and cognitive load during the docking process. Comparative evaluations between experts and novices demonstrate high operational robustness and a rapidly converging learning curve; novice performance in instrument attachment and detachment improved substantially after brief training. While long-distance spatial alignment still presents challenges in time cost and collaborative stability, this study successfully validates the technical feasibility of humanoid robots executing stable instrument exchanges within constrained clinical environments.

关键词: humanoid robots, minimally invasive surgery, instrument exchange, teleoperated system, low-latency docking, first-person view perception, operational robustness, clinical environments

195. ❌ VBGS-SLAM: Variational Bayesian Gaussian Splatting Simultaneous Localization and Mapping

作者: Yuhan Zhu, Yanyu Zhang, Jie Xu, Wei Ren 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02696v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是3D高斯泼溅（3DGS）在SLAM中的应用，提出了一种变分贝叶斯框架来改进相机姿态跟踪和场景建模。所有评分关键词都涉及大语言模型（LLMs）及其相关技术（如训练、对齐、推理优化、应用等），而本文专注于计算机视觉中的SLAM和3D重建，完全不涉及语言模型、深度学习模型训练或AI在科学领域的应用。因此，所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

本文提出了一种变分贝叶斯高斯泼溅SLAM框架（VBGS-SLAM），通过概率建模和变分推理来改进相机姿态跟踪和3D场景重建的鲁棒性，在长序列预测中表现出优越的跟踪性能和高质量的新视角合成。

摘要翻译

三维高斯泼溅（3D Gaussian Splatting，简称3DGS）利用高斯混合模型在三维场景建模中展现出良好效果，然而其现有的同步定位与建图（SLAM）变体通常依赖于对泼溅地图的直接确定性位姿优化，导致其对初始化敏感，且随着地图更新易发生灾难性遗忘。我们提出变分贝叶斯高斯泼溅SLAM（Variational Bayesian Gaussian Splatting SLAM，简称VBGS-SLAM），这是一个新颖的框架，以生成式概率形式耦合了泼溅地图优化与相机位姿跟踪。通过利用多元高斯的共轭特性与变分推断，我们的方法实现了高效的闭式更新，并显式维护位姿与场景参数的后验不确定性。这种不确定性感知机制有效缓解了漂移问题，增强了在挑战性环境下的鲁棒性，同时保持了现有3DGS方法的效率与渲染质量。实验表明，我们的方法在长序列预测中具有更优的跟踪性能与鲁棒性，并在多样化的合成与真实场景中实现了高效且高质量的新视角合成。

摘要 (Abstract)

3D Gaussian Splatting (3DGS) has shown promising results for 3D scene modeling using mixtures of Gaussians, yet its existing simultaneous localization and mapping (SLAM) variants typically rely on direct, deterministic pose optimization against the splat map, making them sensitive to initialization and susceptible to catastrophic forgetting as map evolves. We propose Variational Bayesian Gaussian Splatting SLAM (VBGS-SLAM), a novel framework that couples the splat map refinement and camera pose tracking in a generative probabilistic form. By leveraging conjugate properties of multivariate Gaussians and variational inference, our method admits efficient closed-form updates and explicitly maintains posterior uncertainty over both poses and scene parameters. This uncertainty-aware method mitigates drift and enhances robustness in challenging conditions, while preserving the efficiency and rendering quality of existing 3DGS. Our experiments demonstrate superior tracking performance and robustness in long sequence prediction, alongside efficient, high-quality novel view synthesis across diverse synthetic and real-world scenes.

关键词: 3D Gaussian Splatting, SLAM, Variational Bayesian, Camera Pose Tracking, Scene Reconstruction, Uncertainty Estimation, Novel View Synthesis, Robustness

196. ❌ XrayClaw: Cooperative-Competitive Multi-Agent Alignment for Trustworthy Chest X-ray Diagnosis

作者: Shawn Young, Lijian Xu 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02695v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文XrayClaw专注于医学影像诊断中的多智能体系统，核心贡献在于提出了一种合作-竞争的多智能体对齐框架，以提升胸部X光诊断的可信度。因此，与’AI for Science’（医学影像分析属于科学AI应用）、‘Multi-agent Systems’（多智能体系统）、‘LLM Agents’（智能体框架）、‘Hallucination Mitigation’（缓解诊断幻觉）和’Alignment’（智能体对齐）高度相关（10分）。其他关键词如大模型技术原理、训练方法、推理优化等，论文未涉及，故评0分。

!!! tip deepseek-chat TL;DR

该研究针对传统单体模型在胸部X光诊断中存在的逻辑不一致和诊断幻觉问题，提出了一种合作-竞争的多智能体对齐框架XrayClaw，通过引入竞争性偏好优化，在多个基准测试中实现了最先进的诊断准确性、临床推理保真度和零样本领域泛化性能。

摘要翻译

胸部X光片（Chest X-ray, CXR）解读是一项基础而复杂的临床任务，其自动化日益依赖人工智能技术。然而，传统的单体模型往往缺乏可信诊断所需的细致推理能力，常导致逻辑不一致和诊断幻觉问题。尽管多智能体系统通过模拟协作会诊提供了潜在的解决方案，但现有框架在由单一底层模型实例化时，仍易受基于共识的错误影响。本文提出XrayClaw，一种通过精巧的协作-竞争架构实现多智能体对齐的新型框架。XrayClaw整合了四个专业化的协作智能体以模拟系统性临床工作流程，同时引入一个竞争智能体作为独立审计者。为协调这些不同的诊断路径，我们提出竞争性偏好优化学习目标，通过强制分析性解读与整体性解读之间的相互验证，对不合逻辑的推理进行惩罚。在MS-CXR-T、MIMIC-CXR和CheXbench基准测试上的广泛实验评估表明，XrayClaw在诊断准确性、临床推理保真度和零样本领域泛化能力方面均达到最先进水平。我们的研究结果表明，XrayClaw能有效缓解累积性幻觉问题，提升自动化CXR诊断的整体可靠性，为可信赖的医学影像分析建立了新范式。

摘要 (Abstract)

Chest X-ray (CXR) interpretation is a fundamental yet complex clinical task that increasingly relies on artificial intelligence for automation. However, traditional monolithic models often lack the nuanced reasoning required for trustworthy diagnosis, frequently leading to logical inconsistencies and diagnostic hallucinations. While multi-agent systems offer a potential solution by simulating collaborative consultations, existing frameworks remain susceptible to consensus-based errors when instantiated by a single underlying model. This paper introduces XrayClaw, a novel framework that operationalizes multi-agent alignment through a sophisticated cooperative-competitive architecture. XrayClaw integrates four specialized cooperative agents to simulate a systematic clinical workflow, alongside a competitive agent that serves as an independent auditor. To reconcile these distinct diagnostic pathways, we propose Competitive Preference Optimization, a learning objective that penalizes illogical reasoning by enforcing mutual verification between analytical and holistic interpretations. Extensive empirical evaluations on the MS-CXR-T, MIMIC-CXR, and CheXbench benchmarks demonstrate that XrayClaw achieves state-of-the-art performance in diagnostic accuracy, clinical reasoning fidelity, and zero-shot domain generalization. Our results indicate that XrayClaw effectively mitigates cumulative hallucinations and enhances the overall reliability of automated CXR diagnosis, establishing a new paradigm for trustworthy medical imaging analysis.

关键词: multi-agent systems, medical imaging, chest X-ray diagnosis, hallucination mitigation, cooperative-competitive architecture, trustworthy AI, clinical reasoning, domain generalization

作者: Fuyuan Liu, Dianyu Yu, He Ren, Nayu Liu, Xiaomian Kang, Delai Qiu, Fa Zhang, Genpeng Zhen, Shengping Liu, Jiaen Liang, Wei Huang, Yining Wang, Junnan Zhu 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02692v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于文档解析中的布局分析技术，提出了一种结构细化模块来稳定解析器接口。论文内容涉及计算机视觉、文档理解、DETR检测器、布局实例保留和排序等具体技术，但完全不涉及大语言模型、深度学习技术原理创新、AI for Science等关键词领域。所有关键词均与大模型技术、深度学习创新、科学AI应用无关，因此全部评分为0分。

!!! tip deepseek-chat TL;DR

该论文解决了文档解析中因布局假设不稳定导致的解析器接口不一致问题，通过引入结构细化模块来联合确定实例保留、优化框定位和预测解析器输入顺序，显著提高了页面级布局质量并减少了序列不匹配。

摘要翻译

精确的文档解析需要稳健的内容识别与稳定的解析器接口。在显式文档布局分析（DLA）流程中，下游解析器并不直接使用检测器的全部输出，而是基于一组经过保留与序列化的布局实例进行操作。然而，在区域重叠、边界模糊的密集页面上，不稳定的布局假设可能导致保留的实例集与其解析器输入顺序不一致，进而引发严重的下游解析错误。为解决此问题，我们在DETR风格检测器与解析器之间引入了一个轻量级结构优化阶段，以稳定解析器接口。该方法将原始检测器输出视为一个紧凑的假设池，通过查询特征、语义线索、框几何信息及视觉证据进行集合层面的推理。基于共享的优化结构状态，该模块在交接前联合确定实例保留、优化框定位并预测解析器输入顺序。我们进一步引入了面向保留的监督机制与难度感知的排序目标，以更好地将保留实例集及其顺序与最终解析器输入对齐，尤其在结构复杂的页面上。在公开基准测试上的大量实验表明，我们的方法持续提升了页面级布局质量。当集成至标准端到端解析流程时，稳定的解析器接口也显著减少了序列不匹配问题，在OmniDocBench上实现了0.024的阅读顺序编辑距离。

摘要 (Abstract)

Accurate document parsing requires both robust content recognition and a stable parser interface. In explicit Document Layout Analysis (DLA) pipelines, downstream parsers do not consume the full detector output. Instead, they operate on a retained and serialized set of layout instances. However, on dense pages with overlapping regions and ambiguous boundaries, unstable layout hypotheses can make the retained instance set inconsistent with its parser input order, leading to severe downstream parsing errors. To address this issue, we introduce a lightweight structural refinement stage between a DETR-style detector and the parser to stabilize the parser interface. Treating raw detector outputs as a compact hypothesis pool, the proposed module performs set-level reasoning over query features, semantic cues, box geometry, and visual evidence. From a shared refined structural state, it jointly determines instance retention, refines box localization, and predicts parser input order before handoff. We further introduce retention-oriented supervision and a difficulty-aware ordering objective to better align the retained instance set and its order with the final parser input, especially on structurally complex pages. Extensive experiments on public benchmarks show that our method consistently improves page-level layout quality. When integrated into a standard end-to-end parsing pipeline, the stabilized parser interface also substantially reduces sequence mismatch, achieving a Reading Order Edit of 0.024 on OmniDocBench.

关键词: document parsing, layout analysis, structural refinement, parser interface, DETR detector, instance retention, reading order, OmniDocBench

198. ❌ Drift-Resilient Temporal Priors for Visual Tracking

作者: Yuqing Huang, Liting Lin, Weijun Zhuang, Zhenyu He, Xin Li 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02654v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《Drift-Resilient Temporal Priors for Visual Tracking》专注于计算机视觉中的视觉跟踪任务，提出了一种名为DTPTrack的模块来抑制模型漂移。论文内容涉及视觉跟踪、时序信息处理、模型集成等计算机视觉技术，但完全不涉及大语言模型（LLM）、深度学习技术原理创新或AI在科学领域的应用。所有评分关键词均与大语言模型、深度学习技术原理或AI for Science相关，而本文是纯粹的计算机视觉研究，因此所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对视觉跟踪中多帧跟踪器因历史预测噪声聚合导致的模型漂移问题，提出了一种轻量级通用模块DTPTrack，通过时序可靠性校准和时序引导合成来抑制漂移，在多个基准测试中显著提升了现有跟踪器的性能并达到了新的最先进水平。

摘要翻译

时间信息对于视觉跟踪至关重要，但现有的多帧跟踪器容易因简单聚合带噪声的历史预测而产生模型漂移。本文提出DTPTrack，这是一种轻量级且可泛化的模块，旨在无缝集成到现有跟踪器中以抑制漂移。我们的框架包含两个核心组件：(1) 时间可靠性校准器（Temporal Reliability Calibrator, TRC）机制，该机制学习为历史状态分配逐帧可靠性分数，在锚定真实模板的同时滤除噪声；(2) 时间引导合成器（Temporal Guidance Synthesizer, TGS）模块，将校准后的历史信息合成为一组紧凑的动态时间先验，以提供预测性引导。为验证其通用性，我们将DTPTrack集成到三种不同的跟踪架构——OSTrack、ODTrack和LoRAT中，并在所有基线模型上均取得了一致且显著的性能提升。基于扩展的LoRATv2主干网络构建的最佳性能模型，在多个基准测试中创造了新的最优结果，在LaSOT数据集上取得了77.5%的成功率，在GOT-10k数据集上取得了80.3%的平均重叠率（AO）。

摘要 (Abstract)

Temporal information is crucial for visual tracking, but existing multi-frame trackers are vulnerable to model drift caused by naively aggregating noisy historical predictions. In this paper, we introduce DTPTrack, a lightweight and generalizable module designed to be seamlessly integrated into existing trackers to suppress drift. Our framework consists of two core components: (1) a Temporal Reliability Calibrator (TRC) mechanism that learns to assign a per-frame reliability score to historical states, filtering out noise while anchoring on the ground-truth template; and (2) a Temporal Guidance Synthesizer (TGS) module that synthesizes this calibrated history into a compact set of dynamic temporal priors to provide predictive guidance. To demonstrate its versatility, we integrate DTPTrack into three diverse tracking architectures–OSTrack, ODTrack, and LoRAT-and show consistent, significant performance gains across all baselines. Our best-performing model, built upon an extended LoRATv2 backbone, sets a new state-of-the-art on several benchmarks, achieving a 77.5% Success rate on LaSOT and an 80.3% AO on GOT-10k.

关键词: visual tracking, temporal priors, model drift, DTPTrack, temporal reliability calibration, temporal guidance synthesis, LoRAT, state-of-the-art

199. ❌ Wavelength-multiplexed massively parallel diffractive optical information storage and image projection

作者: Che-Yung Shen, Yuhang Li, Cagatay Isil, Jingxi Li, Leon Lenk, Tianyi Gan, Guangdong Ma, Fazil Onuralp Ardic, Mona Jarrahi, Aydogan Ozcan 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02624v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究的是基于深度学习的波长复用衍射光学信息存储和图像投影系统，属于光学工程和计算成像领域。所有关键词都聚焦于大语言模型（LLM）及其相关技术（如训练方法、推理优化、对齐、应用等），而本文的核心是光学物理系统设计，仅使用深度学习进行结构优化，与LLM技术完全无关。唯一略有相关的是’AI for Science’，因为论文将深度学习应用于科学（光学）问题，但并非LLM在科学领域的应用，因此给5分（有一定关联）。其他关键词均得0分（完全无关）。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于深度学习优化的波长复用衍射光学平台，用于大规模并行光学信息存储和图像投影，通过数值模拟和实验验证了系统可存储和投影数千个独立图像模式，具有高图像质量和低串扰。

摘要翻译

我们提出了一种波长复用的大规模并行衍射信息存储平台，该平台由介电表面构成，通过深度学习在波长尺度上对其结构进行优化，以存储并投射数千个不同的图像模式，每种模式均分配有唯一的波长。通过在可见光谱范围内的数值模拟，我们证明了这种波长复用衍射系统能够在其输出视场内存储并投射超过4000个独立的目标图像/模式，且图像质量高，光谱通道间的串扰极小。此外，在一个概念验证实验中，我们展示了一个双层衍射设计，该设计存储了六种不同的模式，并在六个不同波长（500、548、596、644、692和740纳米）下将它们投射到同一输出视场中。这种衍射架构具有良好的可扩展性，能够在电磁频谱的不同波段工作，而无需进行材料色散工程或重新设计其优化的衍射层。该衍射平台所展示的存储容量、重建图像保真度以及波长编码的大规模并行读取功能，为大规模光学信息存储和图像投影应用提供了一种紧凑且快速访问的解决方案。

摘要 (Abstract)

We introduce a wavelength-multiplexed massively parallel diffractive information storage platform composed of dielectric surfaces that are structurally optimized at the wavelength scale using deep learning to store and project thousands of distinct image patterns, each assigned to a unique wavelength. Through numerical simulations in the visible spectrum, we demonstrated that our wavelength-multiplexed diffractive system can store and project over 4,000 independent desired images/patterns within its output field-of-view, with high image quality and minimal crosstalk between spectral channels. Furthermore, in a proof-of-concept experiment, we demonstrated a two-layer diffractive design that stored six distinct patterns and projected them onto the same output field of view at six different wavelengths (500, 548, 596, 644, 692, and 740 nm). This diffractive architecture is scalable and can operate at various parts of the electromagnetic spectrum without the need for material dispersion engineering or redesigning its optimized diffractive layers. The demonstrated storage capacity, reconstruction image fidelity, and wavelength-encoded massively parallel read-out of our diffractive platform offer a compact and fast-access solution for large-scale optical information storage, image projection applications.

关键词: wavelength-multiplexed, diffractive optics, optical information storage, image projection, deep learning optimization, massively parallel, dielectric surfaces, spectral channels

200. ❌ Unlocking Multi-Site Clinical Data: A Federated Approach to Privacy-First Child Autism Behavior Analysis

作者: Guangyu Sun, Wenhan Wu, Zhishuai Guo, Ziteng Wang, Pegah Khosravi, Chen Chen 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02616v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于使用联邦学习进行儿童自闭症行为识别，属于AI在医疗领域的应用（AI for Science），但未涉及大模型、深度学习技术原理创新或其他关键词。所有其他关键词（0-26）均与大模型技术、训练方法、推理优化、对齐、代理等直接相关，而本文研究的是基于姿态数据的传统计算机视觉联邦学习应用，与这些技术无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于联邦学习的隐私保护框架，用于多站点儿童自闭症行为识别，通过骨骼抽象和联邦学习解决了临床数据隐私和稀缺问题，在MMASD基准上实现了高精度识别。

摘要翻译

儿童自闭症行为的自动识别对于早期干预和客观临床评估至关重要。然而，严格的隐私法规（如HIPAA）及儿科数据的敏感性阻碍了临床数据集的集中整合，这严重制约了鲁棒模型的开发。此外，单个临床站点常面临数据稀缺问题，难以学习泛化的行为模式或使模型适应站点特定的患者分布。为应对这些挑战，我们观察到联邦学习（FL）能够将模型训练与原始数据访问解耦，在维持严格数据驻留的同时实现多站点协作。本文首次探索了基于姿态的儿童自闭症行为识别的联邦学习研究。我们提出的框架采用双层隐私保护机制：利用人体骨骼抽象从原始RGB视频中移除可识别视觉信息，并通过联邦学习确保敏感姿态数据始终保留在诊所内部。该方法通过分布式临床数据学习泛化表征，同时为站点特异性个性化提供灵活性。在MMASD基准测试上的实验结果表明，我们的框架实现了高识别准确率，优于传统联邦基线方法，为多站点临床分析提供了一个鲁棒的、隐私优先的解决方案。

摘要 (Abstract)

Automated recognition of autistic behaviors in children is essential for early intervention and objective clinical assessment. However, the development of robust models is severely hindered by strict privacy regulations (e.g., HIPAA) and the sensitive nature of pediatric data, which prevents the centralized aggregation of clinical datasets. Furthermore, individual clinical sites often suffer from data scarcity, making it difficult to learn generalized behavior patterns or tailor models to site-specific patient distributions. To address these challenges, we observe that Federated Learning (FL) can decouple model training from raw data access, enabling multi-site collaboration while maintaining strict data residency. In this paper, we present the first study exploring Federated Learning for pose-based child autism behavior recognition. Our framework employs a two-layer privacy protection mechanism: utilizing human skeletal abstraction to remove identifiable visual information from the raw RGB videos and FL to ensure sensitive pose data remains within the clinic. This approach leverages distributed clinical data to learn generalized representations while providing the flexibility for site-specific personalization. Experimental results on the MMASD benchmark demonstrate that our framework achieves high recognition accuracy, outperforming traditional federated baselines and providing a robust, privacy-first solution for multi-site clinical analysis.

关键词: Federated Learning, Autism Behavior Recognition, Privacy Protection, Clinical Data Analysis, Pose-based Recognition, Multi-site Collaboration, Data Scarcity, Pediatric Healthcare

201. ❌ Rascene: High-Fidelity 3D Scene Imaging with mmWave Communication Signals

作者: Kunzhe Song, Geo Jie Zhou, Xiaoming Liu, Huacheng Zeng 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02603v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文Rascene专注于利用毫米波通信信号进行3D场景成像，属于集成感知与通信（ISAC）领域。所有评分关键词均与大模型、深度学习技术原理或AI在科学领域的应用相关，而本文研究的是信号处理、无线通信和计算机视觉中的3D重建问题，未涉及任何大模型、深度学习或AI for Science相关内容。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

本文提出了一种名为Rascene的集成感知与通信框架，利用毫米波OFDM通信信号实现高精度3D场景成像，以解决恶劣环境下传统光学传感器失效的问题。

摘要翻译

鲁棒的三维环境感知对于自动驾驶与机器人导航等应用至关重要。然而，相机与激光雷达等光学传感器在烟雾、雾霾及非理想光照等恶劣条件下常会失效。尽管专用雷达系统能在这些环境中工作，但其对定制硬件与授权频谱的依赖限制了可扩展性与成本效益。本文提出Rascene，一种集成感知与通信（ISAC）框架，该框架利用无处不在的毫米波正交频分复用通信信号进行三维场景成像。为克服单个无线电帧固有的稀疏性与多径模糊问题，Rascene通过置信度加权前向投影实现多帧、空间自适应的融合，从而能够在任意位姿间恢复几何一致性。实验结果表明，我们的方法能够高精度重建三维场景，为低成本、可扩展且鲁棒的三维感知提供了新路径。

摘要 (Abstract)

Robust 3D environmental perception is critical for applications such as autonomous driving and robot navigation. However, optical sensors such as cameras and LiDAR often fail under adverse conditions, including smoke, fog, and non-ideal lighting. Although specialized radar systems can operate in these environments, their reliance on bespoke hardware and licensed spectrum limits scalability and cost-effectiveness. This paper introduces Rascene, an integrated sensing and communication (ISAC) framework that leverages ubiquitous mmWave OFDM communication signals for 3D scene imaging. To overcome the sparse and multipath-ambiguous nature of individual radio frames, Rascene performs multi-frame, spatially adaptive fusion with confidence-weighted forward projection, enabling the recovery of geometric consensus across arbitrary poses. Experimental results demonstrate that our method reconstructs 3D scenes with high precision, offering a new pathway toward low-cost, scalable, and robust 3D perception.

关键词: 3D scene imaging, mmWave communication, Integrated Sensing and Communication (ISAC), autonomous driving, robot navigation, multi-frame fusion, forward projection, environmental perception

202. ❌ Hierarchical Planning with Latent World Models

作者: Wancong Zhang, Basile Terver, Artem Zholus, Soham Chitnis, Harsh Sutaria, Mido Assran, Randall Balestriero, Amir Bar, Adrien Bardes, Yann LeCun, Nicolas Ballas 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03208v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于机器人控制中的分层规划与潜在世界模型，仅与’World Models AND General World Models’高度相关（10分），因为其核心是学习多时间尺度的潜在世界模型进行分层规划。其他关键词均未涉及，因为论文不讨论大语言模型、训练技术、推理方法、对齐、压缩、幻觉缓解、科学AI应用等主题。

!!! tip deepseek-chat TL;DR

该论文解决了学习世界模型在长时域控制中因预测误差累积和搜索空间指数增长而面临的挑战，通过提出分层规划与多尺度潜在世界模型的方法，在真实机器人任务和模拟环境中实现了更高的成功率和更低的规划计算成本。

摘要翻译

基于学习世界模型的模型预测控制已成为具身控制领域一种前景广阔的研究范式，其在新环境中部署时表现出的零样本泛化能力尤为突出。然而，由于预测误差的累积和搜索空间的指数级增长，学习得到的世界模型通常在长时域控制任务中面临困难。本研究通过在多时间尺度上学习潜在世界模型，并在这些尺度上进行分层规划，以应对上述挑战；该方法在显著降低推理阶段规划复杂度的同时，实现了长时域推理。我们的框架作为一种模块化的规划抽象层，可适用于多种潜在世界模型架构与任务领域。实验证明，这种分层方法能够在真实世界的非贪婪型机器人任务中实现零样本控制：在仅给定最终目标描述的情况下，分拣放置任务的成功率达到70%，而单层世界模型的成功率仅为0%。此外，在包括推动操作与迷宫导航在内的多个基于物理的仿真环境中，分层规划在取得更高成功率的同时，所需的规划时间计算量最多可减少至四分之一。

摘要 (Abstract)

Model predictive control (MPC) with learned world models has emerged as a promising paradigm for embodied control, particularly for its ability to generalize zero-shot when deployed in new environments. However, learned world models often struggle with long-horizon control due to the accumulation of prediction errors and the exponentially growing search space. In this work, we address these challenges by learning latent world models at multiple temporal scales and performing hierarchical planning across these scales, enabling long-horizon reasoning while substantially reducing inference-time planning complexity. Our approach serves as a modular planning abstraction that applies across diverse latent world-model architectures and domains. We demonstrate that this hierarchical approach enables zero-shot control on real-world non-greedy robotic tasks, achieving a 70% success rate on pick-&-place using only a final goal specification, compared to 0% for a single-level world model. In addition, across physics-based simulated environments including push manipulation and maze navigation, hierarchical planning achieves higher success while requiring up to 4x less planning-time compute.

关键词: hierarchical planning, latent world models, model predictive control, long-horizon control, zero-shot generalization, robotic tasks, planning complexity, multi-temporal scales

203. ❌ A Tsetlin Machine-driven Intrusion Detection System for Next-Generation IoMT Security

作者: Rahul Jaiswal, Per-Arne Andersen, Linga Reddy Cenkeramaddi, Lei Jiao, Ole-Christoffer Granmo 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03205v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文提出了一种基于Tsetlin Machine（TM）的入侵检测系统（IDS），用于检测针对IoMT网络的网络攻击。TM是一种基于规则、可解释的机器学习方法，使用命题逻辑建模攻击模式。论文与大多数关键词（涉及大模型、深度学习技术原理、训练方法、推理优化、智能体等）完全无关，因为这些关键词主要针对大语言模型和深度学习技术，而TM是一种与传统深度学习不同的机器学习方法。仅与’Mechanistic Interpretability OR Explainable AI’有一定关联（5分），因为论文强调了TM的可解释性，提供了类别投票分数和子句激活热图来增强模型信任和可解释性。与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），因为论文应用于医疗物联网（IoMT）安全，属于AI在生物医学/医疗领域的应用，但并非核心的生物信息学或化学信息学。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于Tsetlin Machine的可解释入侵检测系统，用于保护医疗物联网（IoMT）网络，在CICIoMT-2024数据集上实现了99.5%的二元分类准确率和90.7%的多类分类准确率，优于现有方法。

摘要翻译

医疗物联网的快速普及通过实现医疗设备、系统与服务间的无缝连接，正在重塑医疗健康领域。然而，随着攻击者日益利用新方法和新兴漏洞渗透医疗物联网网络，这也引发了严重的网络安全与患者安全问题。本文提出了一种基于新型特斯林机的入侵检测系统，用于检测针对医疗物联网网络的各种网络攻击。特斯林机是一种基于规则且可解释的机器学习方法，它使用命题逻辑对攻击模式进行建模。在包含多种医疗物联网协议和网络攻击类型的CICIoMT-2024数据集上进行的大量实验表明，所提出的基于特斯林机的入侵检测系统性能优于传统机器学习分类器。该模型在二分类任务中达到了99.5%的准确率，在多分类任务中达到了90.7%的准确率，超越了现有的先进方法。此外，为增强模型可信度与可解释性，所提出的基于特斯林机的模型提供了按类别划分的投票分数和子句激活热力图，清晰揭示了最具影响力的子句以及对最终模型决策起主导作用的类别，从而为模型决策机制提供了直观洞察。

摘要 (Abstract)

The rapid adoption of the Internet of Medical Things (IoMT) is transforming healthcare by enabling seamless connectivity among medical devices, systems, and services. However, it also introduces serious cybersecurity and patient safety concerns as attackers increasingly exploit new methods and emerging vulnerabilities to infiltrate IoMT networks. This paper proposes a novel Tsetlin Machine (TM)-based Intrusion Detection System (IDS) for detecting a wide range of cyberattacks targeting IoMT networks. The TM is a rule-based and interpretable machine learning (ML) approach that models attack patterns using propositional logic. Extensive experiments conducted on the CICIoMT-2024 dataset, which includes multiple IoMT protocols and cyberattack types, demonstrate that the proposed TM-based IDS outperforms traditional ML classifiers. The proposed model achieves an accuracy of 99.5% in binary classification and 90.7% in multi-class classification, surpassing existing state-of-the-art approaches. Moreover, to enhance model trust and interpretability, the proposed TM-based model presents class-wise vote scores and clause activation heatmaps, providing clear insights into the most influential clauses and the dominant class contributing to the final model decision.

关键词: Tsetlin Machine, Intrusion Detection System, IoMT Security, Interpretable Machine Learning, CICIoMT-2024 Dataset, Cyberattack Detection, Propositional Logic, Healthcare Cybersecurity

204. ❌ Real-Time Surrogate Modeling for Personalized Blood Flow Prediction and Hemodynamic Analysis

作者: Sokratis J. Anagnostopoulos, George Rovas, Vasiliki Bikia, Theodore G. Papaioannou, Athanase D. Protogerou, Nikolaos Stergiopulos 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03197v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于心血管建模的机器学习应用，使用深度神经网络作为替代模型进行血流动力学预测和参数估计。虽然属于AI在科学领域的应用（生物医学/心血管），但论文未涉及任何大语言模型（LLM）、深度学习技术原理创新或评分关键词中的具体技术（如MoE、Scaling Laws、微调方法等）。仅与最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因为论文应用机器学习解决生物医学问题，但未强调大模型或深度学习技术本身的创新。其他关键词均与论文内容无关。

!!! tip deepseek-chat TL;DR

该研究开发了一个基于深度神经网络的替代模型框架，用于实时预测个性化血流动力学参数和筛选生理学参数，以降低心血管疾病建模中合成数据生成的成本并解决逆问题估计。

摘要翻译

心血管建模在过去几十年中因健康监测与心血管疾病早期诊断的需求增长而迅速发展。一维动脉模型在计算效率与求解保真度之间提供了理想的折衷方案，但其在大型人群中的应用或生成大规模\emph{in silico}（计算机模拟）队列仍面临挑战。某些血流动力学参数（如终端阻力/顺应性）难以通过临床手段准确估计，且若采用简单随机采样常会产生非生理性血流动力学结果，导致大量模拟数据被废弃。本研究提出一个系统性框架，用于训练能够实现瞬时血流动力学预测与参数估计的机器学习模型。我们首先基于大规模阿斯克勒庇俄斯临床数据集中观察到的多变量相关性，生成参数化的虚拟患者队列，确保其符合生理性参数分布。随后训练一个深度神经代理模型，该模型能够预测患者特异性动脉血压与心输出量，从而实现对输入参数的快速先验筛选。这能即时排除非生理性参数组合，并显著降低目标合成数据集（如高血压人群）的生成成本。该模型还提供了终端阻力的规范化采样方法，以最小化不可测量参数的不确定性。此外，通过评估模型的预测性能，我们确定了足以解决心输出量估计这一逆问题的理论信息量。最后，我们将该代理模型应用于临床数据集，以估计中心主动脉血流动力学参数，即心输出量与主动脉收缩压。

摘要 (Abstract)

Cardiovascular modeling has rapidly advanced over the past few decades due to the rising needs for health tracking and early detection of cardiovascular diseases. While 1-D arterial models offer an attractive compromise between computational efficiency and solution fidelity, their application on large populations or for generating large \emph{in silico} cohorts remains challenging. Certain hemodynamic parameters like the terminal resistance/compliance, are difficult to clinically estimate and often yield non-physiological hemodynamics when sampled naively, resulting in large portions of simulated datasets to be discarded. In this work, we present a systematic framework for training machine learning (ML) models, capable of instantaneous hemodynamic prediction and parameter estimation. We initially start with generating a parametric virtual cohort of patients which is based on the multivariate correlations observed in the large Asklepios clinical dataset, ensuring that physiological parameter distributions are respected. We then train a deep neural surrogate model, able to predict patient-specific arterial pressure and cardiac output (CO), enabling rapid a~priori screening of input parameters. This allows for immediate rejection of non-physiological combinations and drastically reduces the cost of targeted synthetic dataset generation (e.g. hypertensive groups). The model also provides a principled means of sampling the terminal resistance to minimize the uncertainties of unmeasurable parameters. Moreover, by assessing the model’s predictive performance we determine the theoretical information which suffices for solving the inverse problem of estimating the CO. Finally, we apply the surrogate on a clinical dataset for the estimation of central aortic hemodynamics i.e. the CO and aortic systolic blood pressure (cSBP).

关键词: cardiovascular modeling, hemodynamic prediction, deep neural surrogate model, parameter estimation, synthetic dataset generation, inverse problem, clinical dataset, aortic hemodynamics

205. ❌ DSBD: Dual-Aligned Structural Basis Distillation for Graph Domain Adaptation

作者: Yingxu Wang, Kunyu Zhang, Jiaxin Huang, Mengzhu Wang, Mingyan Xiao, Siyang Gao, Nan Yin 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03154v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于图神经网络（GNN）的领域自适应（Domain Adaptation），特别是图结构对齐问题，与大多数大模型（LLM）相关关键词无关。唯一相关的是’Pre-training OR Continual Pre-training OR Domain Adaptation’，因为论文核心是图领域自适应（GDA），属于领域自适应（Domain Adaptation）的子领域，但论文未涉及预训练或持续预训练，因此给8分（有一定关联，非核心）。其他关键词均涉及大模型技术、推理、对齐、压缩等，与论文的图神经网络和结构对齐研究无直接关系。

!!! tip deepseek-chat TL;DR

该论文针对图领域自适应中结构差异导致知识迁移不可靠的问题，提出了双对齐结构基蒸馏（DSBD）框架，通过几何和谱一致性对齐跨域结构，实验表明其优于现有方法。

摘要翻译

图域适应（Graph Domain Adaptation, GDA）旨在解决分布偏移下将知识从带标签的源图迁移至无标签目标图的问题。然而，现有方法大多以特征为中心，忽视了结构差异，这在显著的拓扑偏移下尤为不利。此类差异会同时改变几何关系与谱特性，导致图神经网络（Graph Neural Networks, GNNs）的迁移不可靠。为应对这一局限，我们提出一种新颖的GDA框架——双对齐结构基蒸馏（Dual-Aligned Structural Basis Distillation, DSBD），该框架显式建模并适应跨域结构变化。DSBD通过合成连续的概率原型图构建可微分的结构基，从而实现对图拓扑的基于梯度的优化。该结构基在源域监督下学习以保持语义判别性，同时通过双对齐目标显式对齐至目标域。具体而言，我们通过置换不变的拓扑矩匹配强制几何一致性，并通过狄利克雷能量校准实现谱一致性，从而共同捕捉跨域的结构特性。此外，我们引入一种解耦推理范式，通过在蒸馏所得的结构基上训练新的GNN来减轻源域特有的结构偏差。在图与图像基准上的大量实验表明，DSBD在性能上持续优于现有最先进方法。

摘要 (Abstract)

Graph domain adaptation (GDA) aims to transfer knowledge from a labeled source graph to an unlabeled target graph under distribution shifts. However, existing methods are largely feature-centric and overlook structural discrepancies, which become particularly detrimental under significant topology shifts. Such discrepancies alter both geometric relationships and spectral properties, leading to unreliable transfer of graph neural networks (GNNs). To address this limitation, we propose Dual-Aligned Structural Basis Distillation (DSBD) for GDA, a novel framework that explicitly models and adapts cross-domain structural variation. DSBD constructs a differentiable structural basis by synthesizing continuous probabilistic prototype graphs, enabling gradient-based optimization over graph topology. The basis is learned under source-domain supervision to preserve semantic discriminability, while being explicitly aligned to the target domain through a dual-alignment objective. Specifically, geometric consistency is enforced via permutation-invariant topological moment matching, and spectral consistency is achieved through Dirichlet energy calibration, jointly capturing structural characteristics across domains. Furthermore, we introduce a decoupled inference paradigm that mitigates source-specific structural bias by training a new GNN on the distilled structural basis. Extensive experiments on graph and image benchmarks demonstrate that DSBD consistently outperforms state-of-the-art methods.

关键词: Graph Domain Adaptation, Structural Discrepancies, Dual-Aligned Structural Basis Distillation, Geometric Consistency, Spectral Consistency, Graph Neural Networks, Topology Shifts, Probabilistic Prototype Graphs

206. ❌ HyperFitS – Hypernetwork Fitting Spectra for metabolic quantification of ${}^1$H MR spectroscopic imaging

作者: Paul J. Weiser, Gulnur Ungan, Amirmohammad Shamaei, Georg Langs, Wolfgang Bogner, Malte Hoffmann, Antoine Klauser, Ovidiu C. Andronesi 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03150v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于医学影像分析中的代谢物量化问题，提出了一种基于超网络的深度学习模型HyperFitS，用于快速光谱拟合。所有关键词均与大模型、深度学习技术原理或通用AI方法相关，而本文是特定领域的应用研究，未涉及大模型、MoE、缩放定律、训练对齐、推理优化、智能体、模型压缩等通用技术。唯一相关的是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在生物医学（磁共振波谱成像）领域的应用，但并非核心创新点，只是应用场景，因此给5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该研究解决了质子磁共振波谱成像中代谢物量化耗时的问题，提出了一种可配置的超网络模型HyperFitS，能在几秒内完成全脑代谢图谱的量化，结果与金标准方法高度一致，且无需重新训练即可适应多种采集协议。

摘要翻译

目的：质子磁共振波谱成像（$^1$H MRSI）能够实现全脑代谢物浓度的活体图谱绘制。然而，其临床应用面临的一个长期问题是代谢物定量分析，这通常需要耗费大量时间进行谱线拟合。近年来，深度学习方法已能在数秒内提供全脑代谢物定量结果。然而，神经网络实现通常缺乏可配置性，且需要重新训练才能更改预定义的参数设置。方法：我们提出HyperFitS，一种用于全脑$^1$H MRSI代谢物定量谱线拟合的超网络，它能灵活适应广泛的基线校正和水抑制因子设置。利用HyperFitS对通过水抑制与非水抑制MRSI在3T和7T场强下以10毫米、3.4毫米和2毫米各向同性分辨率采集的人类受试者代谢物图谱进行定量分析，并与传统LCModel拟合结果进行比较。结果：代谢物图谱显示新方法与金标准方法具有高度一致性，且HyperFitS的拟合时间显著缩短。定量结果进一步凸显了基线参数化对代谢物定量的影响，该因素可使结果差异高达30%。结论：HyperFitS与当前最先进的传统方法表现出高度一致性，同时将处理时间从数小时缩短至数秒。与先前基于深度学习的谱线拟合方法相比，HyperFitS具备广泛的可配置性，无需重新训练即可适应多种采集协议和场强下获取的数据质量。

摘要 (Abstract)

Purpose: Proton magnetic resonance spectroscopic imaging ($^1$H MRSI) enables the mapping of whole-brain metabolites concentrations in-vivo. However, a long-standing problem for its clinical applicability is the metabolic quantification, which can require extensive time for spectral fitting. Recently, deep learning methods have been able to provide whole-brain metabolic quantification in only a few seconds. However, neural network implementations often lack configurability and require retraining to change predefined parameter settings. Methods: We introduce HyperFitS, a hypernetwork for spectral fitting for metabolite quantification in whole-brain $^1$H MRSI that flexibly adapts to a broad range of baseline corrections and water suppression factors. Metabolite maps of human subjects acquired at 3T and 7T with isotropic resolutions of 10 mm, 3.4 mm and 2 mm by water-suppressed and water-unsuppressed MRSI were quantified with HyperFitS and compared to conventional LCModel fitting. Results: Metabolic maps show a substantial agreement between the new and gold-standard methods, with significantly faster fitting times by HyperFitS. Quantitative results further highlight the impact of baseline parametrization on metabolic quantification, which can alter results by up to 30%. Conclusion: HyperFitS shows strong agreement with state-of-the-art conventional methods, while reducing processing times from hours to a few seconds. Compared to prior deep learning based spectral fitting methods, HyperFitS enables a wide range of configurability and can adapt to data quality acquired with multiple protocols and field strengths without retraining.

关键词: Hypernetwork, Spectral fitting, Metabolic quantification, Magnetic resonance spectroscopic imaging, Deep learning, Whole-brain metabolites, Fast processing, Configurable model

207. ❌ Characterization of Gaussian Universality Breakdown in High-Dimensional Empirical Risk Minimization

作者: Chiheb Yaakoubi, Cosme Louart, Malik Tiomoko, Zhenyu Liao 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03146v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究高维凸经验风险最小化（ERM）在非高斯数据设计下的统计特性，属于统计学习理论和数学统计领域。论文内容涉及凸优化、高斯极小极大定理扩展、渐近分析等理论方法，与所有评分关键词（均围绕大模型、深度学习技术及其应用）完全无关。论文未涉及任何大模型架构、训练方法、推理优化、对齐技术、代理系统或科学AI应用。

!!! tip deepseek-chat TL;DR

该论文研究了高维凸经验风险最小化在非高斯数据下的渐近统计特性，推导了估计量的均值和协方差的近似表达式，并揭示了高斯普适性在ERM中的适用范围和局限性。

摘要翻译

我们研究一般非高斯数据设计下的高维凸经验风险最小化（ERM）问题。通过启发式地将凸高斯最小最大定理（CGMT）推广至非高斯设定，我们推导出关键统计量的渐近最小最大刻画，从而能够近似ERM估计量$\hatθ$的均值$μ_{\hatθ}$与协方差$C_{\hatθ}$。具体而言，在数据矩阵满足集中性假设且损失函数与正则项满足标准正则性条件下，我们证明：对于一个独立于训练数据的测试协变量$x$，其投影$\hatθ^\top x$近似服从$μ_{\hatθ}^\top x$（通常为非高斯分布）与一个独立、方差为$\text{Tr}(C_{\hatθ}\mathbb{E}[xx^\top])$的零均值高斯变量的卷积分布。这一结果阐明了ERM问题中高斯普适性的适用范围与局限性。此外，我们证明任何$\mathcal{C}^2$类正则项在渐近意义上均等价于一个二次型，该二次型完全由其在零点处的海森矩阵及其在$μ_{\hatθ}$处的梯度决定。文中提供了多种损失函数与模型的数值模拟，以验证我们的理论预测与定性结论。

摘要 (Abstract)

We study high-dimensional convex empirical risk minimization (ERM) under general non-Gaussian data designs. By heuristically extending the Convex Gaussian Min-Max Theorem (CGMT) to non-Gaussian settings, we derive an asymptotic min-max characterization of key statistics, enabling approximation of the mean $μ_{\hatθ}$ and covariance $C_{\hatθ}$ of the ERM estimator $\hatθ$. Specifically, under a concentration assumption on the data matrix and standard regularity conditions on the loss and regularizer, we show that for a test covariate $x$ independent of the training data, the projection $\hatθ^\top x$ approximately follows the convolution of the (generally non-Gaussian) distribution of $μ_{\hatθ}^\top x$ with an independent centered Gaussian variable of variance $\text{Tr}(C_{\hatθ}\mathbb{E}[xx^\top])$. This result clarifies the scope and limits of Gaussian universality for ERMs. Additionally, we prove that any $\mathcal{C}^2$ regularizer is asymptotically equivalent to a quadratic form determined solely by its Hessian at zero and gradient at $μ_{\hatθ}$. Numerical simulations across diverse losses and models are provided to validate our theoretical predictions and qualitative insights.

关键词: Empirical Risk Minimization, High-dimensional Statistics, Convex Optimization, Gaussian Universality, Non-Gaussian Data, Asymptotic Analysis, Convex Gaussian Min-Max Theorem

208. ❌ SkillRT: Compiling Skills for Efficient Execution Everywhere

作者: Le Chen, Erhu Feng, Yubin Xia, Haibo Chen 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03088v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文SkillRT专注于LLM Agents的技能执行优化，与’LLM Agents’高度相关（10分），因为核心研究就是提升agent技能的可移植性和执行效率。与’Tool Use’有一定关联（5分），因为技能可以视为工具的一种形式，但论文更侧重于技能作为可组合单元的系统级优化，而非具体的工具调用机制。与’Inference Acceleration’有一定关联（5分），因为SkillRT通过编译优化、JIT代码固化和并行化实现了高达3.2倍的加速和19-50倍的延迟降低，这属于推理加速范畴。与’Large Language Models’高度相关（10分），因为论文明确以LLM为处理器，并在八个不同规模的LLM上进行了评估。其他关键词如MoE、SFT、RAG、CoT等均未在论文中涉及，故得0分。

!!! tip deepseek-chat TL;DR

该论文针对LLM Agents中技能可移植性差和执行效率低的问题，提出了SkillRT编译和运行时系统，通过能力分析、环境绑定和并发优化，显著提高了任务完成率并降低了token消耗和延迟。

摘要翻译

大型语言模型智能体正日益将技能作为可复用的组合单元。尽管技能在不同智能体平台间共享，现有系统仍将其视为原始上下文处理，导致同一技能在不同智能体间表现不一致。这种脆弱性损害了技能的可移植性与执行效率。
为应对这一挑战，我们分析了118,000项技能，并从传统编译器设计中汲取灵感。我们将技能视为代码，将大型语言模型视作异构处理器。为实现可移植性，我们将技能需求分解为一组基础能力，并量化评估每个模型-框架组合对这些能力的支持程度。基于此能力画像，我们提出SkillRT——一个为可移植、高效技能执行设计的编译与运行时系统。在编译时，SkillRT执行基于能力的编译、环境绑定与并发性提取；在运行时，则通过即时代码固化与自适应重编译进行性能优化。
我们在八种不同规模的LLM与三种智能体框架上评估SkillRT，覆盖SkillsBench基准测试及典型技能任务。实验结果表明，SkillRT显著提升了不同模型与环境下的任务完成率，同时将令牌消耗降低最高达40%。性能方面，SkillRT通过增强并行性实现最高3.2倍加速，并借助代码固化将延迟降低19-50倍。

摘要 (Abstract)

LLM agents increasingly adopt skills as a reusable unit of composition. While skills are shared across diverse agent platforms, current systems treat them as raw context, causing the same skill to behave inconsistently for different agents. This fragility undermines skill portability and execution efficiency. To address this challenge, we analyze 118,000 skills and draw inspiration from traditional compiler design. We treat skills as code and LLMs as heterogeneous processors. To make portability actionable, we decompose a skill’s requirements into a set of primitive capabilities, and measure how well each model-harness pair supports them. Based on these capability profiles, we propose SkillRT, a compilation and runtime system designed for portable and efficient skill execution. At compile time, SkillRT performs capability-based compilation, environment binding, and concurrency extraction. At runtime, SkillRT applies JIT code solidification and adaptive recompilation for performance optimization. We evaluate SkillRT across eight LLMs of varying scales and three agent harnesses, covering SkillsBench and representative skill tasks. Results demonstrate that SkillRT significantly improves task completion rates across different models and environments while reducing token consumption by up to 40%. In terms of performance, SkillRT achieves up to 3.2x speedup with enhanced parallelism, and 19-50x latency reduction through code solidification.

关键词: LLM agents, skill compilation, portability, execution efficiency, capability profiling, runtime optimization, JIT solidification, adaptive recompilation

209. ❌ On Data-Driven Koopman Representations of Nonlinear Delay Differential Equations

作者: Santosh Mohan Rajkumar, Dibyasri Barman, Kumar Vikram Singh, Debdipta Goswami 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03086v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究非线性延迟微分方程的Koopman表示和数据驱动学习，属于AI for Science（科学AI）领域，与关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），但论文未涉及大模型、深度学习技术原理或任何其他关键词，因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于历史离散化和核扩展动态模态分解的有限维Koopman近似框架，用于非线性延迟微分方程，并推导了确定性误差界限，实现了延迟系统的可靠预测和控制。

摘要翻译

本研究在无限维时滞动力学与有限维库普曼学习之间建立了严谨的理论桥梁，并提供了明确且可解释的误差保证。尽管库普曼分析在常微分方程（ODEs）领域已较为成熟，并在偏微分方程（PDEs）中有所应用，但由于时滞微分方程（DDEs）具有无限维相空间，其向DDEs的扩展仍十分有限。我们提出了一种基于历史离散化与合适重构算子的有限维库普曼近似框架，使得通过基于核的扩展动态模态分解（kEDMD）能够对库普曼算子进行可处理的表示。我们为学习得到的预测器推导了确定性误差界，将总误差分解为来自历史离散化、核插值以及数据驱动回归的贡献。此外，我们开发了一种基于核的重构方法，可从提升后的库普曼坐标中恢复离散化状态，并提供了可证明的保证。数值结果表明，学习得到的预测器在离散化精度和训练数据方面均表现出收敛性，从而支持对时滞系统进行可靠的预测与控制。

摘要 (Abstract)

This work establishes a rigorous bridge between infinite-dimensional delay dynamics and finite-dimensional Koopman learning, with explicit and interpretable error guarantees. While Koopman analysis is well-developed for ordinary differential equations (ODEs) and partially for partial differential equations (PDEs), its extension to delay differential equations (DDEs) remains limited due to the infinite-dimensional phase space of DDEs. We propose a finite-dimensional Koopman approximation framework based on history discretization and a suitable reconstruction operator, enabling a tractable representation of the Koopman operator via kernel-based extended dynamic mode decomposition (kEDMD). Deterministic error bounds are derived for the learned predictor, decomposing the total error into contributions from history discretization, kernel interpolation, and data-driven regression. Additionally, we develop a kernel-based reconstruction method to recover discretized states from lifted Koopman coordinates, with provable guarantees. Numerical results demonstrate convergence of the learned predictor with respect to both discretization resolution and training data, supporting reliable prediction and control of delay systems.

关键词: Koopman operator, delay differential equations, data-driven learning, dynamic mode decomposition, error bounds, history discretization, nonlinear systems, control prediction

210. ❌ Learning Contractive Integral Operators with Fredholm Integral Neural Operators

作者: Kyriakos C. Georgiou, Constantinos Siettos, Athanasios N. Yannacopoulos 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03034v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文提出了一种名为FREDINO的神经网络算子框架，用于学习Fredholm积分方程中的非扩张积分算子，并应用于求解非线性椭圆偏微分方程。该研究属于科学机器学习/数值分析领域，与大多数关键词（如LLM、MoE、RLHF、RAG等）完全无关，因为这些关键词主要涉及大语言模型及其相关技术。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于科学机器学习（Scientific Machine Learning）范畴，但并非生物信息学或化学信息学，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了Fredholm积分神经算子（FREDINOs）框架，用于学习和逼近Fredholm积分方程中的非扩张积分算子，并证明了其通用逼近性和收缩性，同时展示了其在求解非线性椭圆偏微分方程中的应用。

摘要翻译

我们将弗雷德霍姆神经网络框架推广至任意维度，用于学习第二类弗雷德霍姆积分方程中出现的非扩张积分算子。首先，我们针对弗雷德霍姆积分方程提出了弗雷德霍姆积分神经算子，并证明其能够通用逼近线性和非线性积分算子及相应的解算子。进一步，我们证明了所学算子具有严格的压缩性，从而完全满足不动点迭代收敛所需的数学性质。最后，我们还展示了如何通过边界积分方程形式，利用弗雷德霍姆积分神经算子学习非线性椭圆型偏微分方程的解算子。我们通过多个基准问题对所提方法进行了数值评估：包括任意维度的线性和非线性弗雷德霍姆积分方程，以及二维非线性椭圆型偏微分方程。基于定制化的数学与数值分析理论，弗雷德霍姆积分神经算子能够提供高精度逼近和可解释的求解方案，使其非常适用于科学机器学习与数值分析计算。

摘要 (Abstract)

We generalize the framework of Fredholm Neural Networks, to learn non-expansive integral operators arising in Fredholm Integral Equations (FIEs) of the second kind in arbitrary dimensions. We first present the proposed Fredholm Integral Neural Operators (FREDINOs), for FIEs and prove that they are universal approximators of linear and non-linear integral operators and corresponding solution operators. We furthermore prove that the learned operators are guaranteed to be contractive, thereby strictly satisfying the mathematical property required for the convergence of the fixed point scheme. Finally, we also demonstrate how FREDINOs can be used to learn the solution operator of non-linear elliptic PDEs, via a Boundary Integral Equation (BIE) formulation. We assess the proposed methodology numerically, via several benchmark problems: linear and non-linear FIEs in arbitrary dimensions, as well as a non-linear elliptic PDE in 2D. Built on tailored mathematical/numerical analysis theory, FREDINOs offer high-accuracy approximations and interpretable schemes, making them well suited for scientific machine learning/numerical analysis computations.

关键词: Fredholm Integral Neural Operators, Fredholm Integral Equations, contractive operators, universal approximators, non-linear elliptic PDEs, Boundary Integral Equation, scientific machine learning, numerical analysis

211. ❌ Generating DDPM-based Samples from Tilted Distributions

作者: Himadri Mandal, Dhruman Gupta, Rushil Gupta, Sarvesh Ravichandran Iyer, Agniv Bandyopadhyay, Achal Bassamboo, Varun Gupta, Sandeep Juneja 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03015v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究扩散模型（DDPM）在倾斜分布中的样本生成问题，属于扩散模型的理论和应用研究，但所有关键词均针对大语言模型（LLM）及其相关技术（如训练方法、推理优化、应用框架等），而论文未涉及任何LLM内容，也未提及深度学习在科学领域的创新应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文研究了如何从倾斜分布中生成扩散模型样本，提出了一个最小化最优的插件估计器，并证明了其理论性能。

摘要翻译

给定从 $d$ 维概率分布中独立抽取的 $n$ 个样本，我们的目标是通过对原始分布进行倾斜（倾斜程度由参数 $θ\in \mathbb{R}^d$ 控制）后，生成基于扩散模型的样本。我们定义了一种插件估计量，并证明其具有极小极大最优性。我们建立了插件估计量分布与真实分布之间的 Wasserstein 距离界，该界是 $n$ 和 $θ$ 的函数，从而阐明了输出分布与期望真实分布接近的若干机制。此外，在某些假设下，我们证明了在这些倾斜样本上运行扩散模型可获得总变差（TV）精度。我们的理论结果得到了大量模拟实验的支持。本工作的应用领域包括金融、天气与气候建模以及其他许多领域，其目标可能是从满足实际驱动的矩约束的倾斜分布中生成样本。

摘要 (Abstract)

Given $n$ independent samples from a $d$-dimensional probability distribution, our aim is to generate diffusion-based samples from a distribution obtained by tilting the original, where the degree of tilt is parametrized by $θ\in \mathbb{R}^d$. We define a plug-in estimator and show that it is minimax-optimal. We develop Wasserstein bounds between the distribution of the plug-in estimator and the true distribution as a function of $n$ and $θ$, illustrating regimes where the output and the desired true distribution are close. Further, under some assumptions, we prove the TV-accuracy of running Diffusion on these tilted samples. Our theoretical results are supported by extensive simulations. Applications of our work include finance, weather and climate modelling, and many other domains, where the aim may be to generate samples from a tilted distribution that satisfies practically motivated moment constraints.

关键词: diffusion models, DDPM, tilted distributions, sample generation, Wasserstein bounds, minimax-optimal, plug-in estimator, TV-accuracy

212. ❌ Inversion-Free Natural Gradient Descent on Riemannian Manifolds

作者: Dario Draca, Takuo Matsubara, Minh-Ngoc Tran 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02969v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于黎曼流形上的自然梯度下降优化方法，属于数学优化和统计学习领域。虽然论文在变分贝叶斯和归一化流等应用场景中进行了演示，但其核心贡献是数学优化算法（inversion-free stochastic natural gradient method on Riemannian manifolds），而非大模型、深度学习技术或AI for Science的具体应用。所有关键词均涉及大模型技术、训练方法、推理优化、AI应用等特定领域，与该论文的数学优化主题完全无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种在黎曼流形上的无逆随机自然梯度方法，通过在线近似逆Fisher信息矩阵并利用传输操作处理不同切空间中的得分向量，证明了算法的收敛性，并在变分贝叶斯和归一化流应用中展示了优于欧几里得对应方法的性能。

摘要翻译

自然梯度方法在统计优化中应用广泛，但其标准形式假设参数空间为欧几里得空间。本文针对参数位于黎曼流形上的概率分布，提出了一种无需显式求逆的随机自然梯度方法。流形框架具有多重优势：可以隐式地强制参数约束（如正定性与正交性），确保参数可识别，或保证目标函数的正则性质（如测地凸性）。基于流形上费希尔信息矩阵（FIM）的内蕴表述，本方法在线维护逆FIM的近似，该近似通过连续迭代点处采样的得分向量以二次计算代价高效更新。在黎曼流形框架下，这些得分向量属于不同的切空间，必须借助传输操作进行组合。我们证明了当步长指数α>2/3时，算法到最小化点的平方距离具有$O(\log{s}/s^α)$的几乎必然收敛速率。同时，我们也建立了近似FIM的几乎必然收敛速率，此时近似过程会累积基于传输的误差。本文进一步提出了具有次二次存储复杂度的算法有限内存变体。最后，我们在高斯近似变分贝叶斯与标准化流模型上验证了本方法相较于欧几里得对应方法的有效性。

摘要 (Abstract)

The natural gradient method is widely used in statistical optimization, but its standard formulation assumes a Euclidean parameter space. This paper proposes an inversion-free stochastic natural gradient method for probability distributions whose parameters lie on a Riemannian manifold. The manifold setting offers several advantages: one can implicitly enforce parameter constraints such as positive definiteness and orthogonality, ensure parameters are identifiable, or guarantee regularity properties of the objective like geodesic convexity. Building on an intrinsic formulation of the Fisher information matrix (FIM) on a manifold, our method maintains an online approximation of the inverse FIM, which is efficiently updated at quadratic cost using score vectors sampled at successive iterates. In the Riemannian setting, these score vectors belong to different tangent spaces and must be combined using transport operations. We prove almost-sure convergence rates of $O(\log{s}/s^α)$ for the squared distance to the minimizer when the step size exponent $α>2/3$. We also establish almost-sure rates for the approximate FIM, which now accumulates transport-based errors. A limited-memory variant of the algorithm with sub-quadratic storage complexity is proposed. Finally, we demonstrate the effectiveness of our method relative to its Euclidean counterparts on variational Bayes with Gaussian approximations and normalizing flows.

关键词: natural gradient method, Riemannian manifold, Fisher information matrix, stochastic optimization, variational Bayes, normalizing flows, convergence analysis, parameter constraints

213. ❌ Explainable Machine Learning Reveals 12-Fold Ucp1 Upregulation and Thermogenic Reprogramming in Female Mouse White Adipose Tissue After 37 Days of Microgravity: First AI/ML Analysis of NASA OSD-970

作者: Md. Rashadul Islam 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02942v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文主要应用传统机器学习方法（如随机森林、PCA）和可解释AI技术（SHAP）分析微重力对小鼠白色脂肪组织基因表达的影响，属于AI在生物信息学/科学领域的应用，与’Explainable AI’和’AI for Science’高度相关（10分）。但论文未涉及大语言模型、MoE、模型训练/微调、推理优化、智能体等大模型相关技术，其他关键词均得0分。

!!! tip deepseek-chat TL;DR

本研究首次应用可解释机器学习分析NASA微重力实验数据，发现女性小鼠白色脂肪组织在37天微重力暴露后Ucp1基因表达上调12.21倍，并激活产热通路，揭示了微重力诱导的快速产热重编程机制。

摘要翻译

微重力环境会引发哺乳动物生理的深刻代谢适应，然而调控雌性白色脂肪组织（White Adose Tissue, WAT）产热的分子机制仍不甚明确。本文首次对源自啮齿动物研究-1（Rodent Research-1, RR-1）任务的NASA开放科学数据仓库（Open Science Data Repository, OSDR）数据集OSD-970进行了机器学习（Machine Learning, ML）分析。利用在国际空间站（International Space Station, ISS）停留37天后16只雌性C57BL/6J小鼠（8只飞行组，8只地面对照组）性腺WAT中89个脂肪生成与产热通路基因的RT-qPCR数据，我们应用了差异表达分析、采用留一法交叉验证（Leave-One-Out Cross-Validation, LOO-CV）的多种ML分类器，以及通过SHapley可加性解释（SHapley Additive exPlanations, SHAP）的可解释人工智能方法。最显著的发现是，暴露于微重力的WAT中Ucp1基因表达出现惊人的12.21倍上调（ΔΔCt = -3.61, p = 0.0167），同时产热通路被显著激活（平均通路倍数变化 = 3.24）。性能最佳的模型（基于前20个特征的随机森林）通过LOO-CV实现了AUC = 0.922、准确率 = 0.812和F1分数 = 0.824。SHAP分析一致将Ucp1列为最重要的预测特征之一，而Angpt2、Irs2、Jun以及Klf家族转录因子则成为主导性的共识分类特征。主成分分析（Principal Component Analysis, PCA）显示飞行组与地面对照组样本明显分离，第一主成分（PC1）解释了69.1%的方差。这些结果表明，雌性WAT中存在快速的产热重编程，这是对微重力环境的一种代偿性反应。本研究证明了可解释人工智能在重新分析新发布的NASA空间生物学数据集方面的强大能力，其结果对长期任务中女性宇航员的健康以及地球上的肥胖与代谢疾病研究具有直接意义。

摘要 (Abstract)

Microgravity induces profound metabolic adaptations in mammalian physiology, yet the molecular mechanisms governing thermogenesis in female white adipose tissue (WAT) remain poorly characterized. This paper presents the first machine learning (ML) analysis of NASA Open Science Data Repository (OSDR) dataset OSD-970, derived from the Rodent Research-1 (RR-1) mission. Using RT-qPCR data from 89 adipogenesis and thermogenesis pathway genes in gonadal WAT of 16 female C57BL/6J mice (8 flight, 8 ground control) following 37 days aboard the International Space Station (ISS), we applied differential expression analysis, multiple ML classifiers with Leave-One-Out Cross-Validation (LOO-CV), and Explainable AI via SHapley Additive exPlanations (SHAP). The most striking finding is a dramatic 12.21-fold upregulation of Ucp1 (Delta-Delta-Ct = -3.61, p = 0.0167) in microgravity-exposed WAT, accompanied by significant activation of the thermogenesis pathway (mean pathway fold-change = 3.24). The best-performing model (Random Forest with top-20 features) achieved AUC = 0.922, Accuracy = 0.812, and F1 = 0.824 via LOO-CV. SHAP analysis consistently ranked Ucp1 among the top predictive features, while Angpt2, Irs2, Jun, and Klf-family transcription factors emerged as dominant consensus classifiers. Principal component analysis (PCA) revealed clear separation between flight and ground samples, with PC1 explaining 69.1% of variance. These results suggest rapid thermogenic reprogramming in female WAT as a compensatory response to microgravity. This study demonstrates the power of explainable AI for re-analysis of newly released NASA space biology datasets, with direct implications for female astronaut health on long-duration missions and for Earth-based obesity and metabolic disease research.

关键词: Explainable AI, Machine Learning, Microgravity, White Adipose Tissue, Ucp1, Thermogenesis, NASA OSDR, SHAP

214. ❌ Towards Near-Real-Time Telemetry-Aware Routing with Neural Routing Algorithms

作者: Andreas Boltres, Niklas Freymuth, Benjamin Schichtholz, Michael König, Gerhard Neumann 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02927v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是计算机网络中的路由算法优化问题，具体关注利用神经网络和强化学习进行实时遥测感知路由。虽然论文使用了神经网络和强化学习技术，但所有关键词都明确针对大语言模型（LLMs）及其相关技术（如MoE、SFT、RLHF、RAG等）、特定推理方法（如CoT、System 2）、模型优化技术（如量化、注意力机制）或科学AI应用。论文内容完全不涉及语言模型、文本生成、对话系统或任何关键词中提到的具体大模型技术，而是专注于网络路由这一特定工程问题，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一个考虑通信和推理延迟的神经路由算法框架LOGGIA，它通过数据驱动的预训练和策略强化学习来优化网络路由，实验表明在真实网络拓扑和混合流量下，LOGGIA优于最短路径基线，且完全本地部署效果最佳。

摘要翻译

路由算法对于计算机网络的高效运行至关重要，在许多场景中，它们必须能够在毫秒级时间内对流量突发做出反应。实时遥测数据可为路由算法提供信息信号，近期研究已尝试训练神经网络以利用此类信号实现流量感知路由。然而，全网范围的信息聚合会受通信延迟影响，现有神经路由方法要么假设不切实际的无延迟全局状态，要么将路由器限制为纯局部遥测，这导致其在实际环境中的可部署性尚不明确。本文将遥测感知路由建模为延迟感知的闭环控制问题，并提出一个在明确建模通信与推理延迟的同时训练和评估神经路由算法的框架。基于此框架，我们提出LOGGIA——一种可扩展的图神经路由算法，它能够从带属性的拓扑-遥测图中预测对数空间链路权重。该算法采用数据驱动的预训练阶段，继而进行策略强化学习。在合成与真实网络拓扑以及未见过的混合TCP/UDP流量序列测试中，LOGGIA始终优于最短路径基线算法，而现有神经基线方法在引入真实延迟后均失效。我们的实验进一步表明，类似LOGGIA的神经路由算法在完全本地化部署时表现最佳，即在每个路由器上独立观测网络状态并推断行动，而非采用集中式决策。

摘要 (Abstract)

Routing algorithms are crucial for efficient computer network operations, and in many settings they must be able to react to traffic bursts within milliseconds. Live telemetry data can provide informative signals to routing algorithms, and recent work has trained neural networks to exploit such signals for traffic-aware routing. Yet, aggregating network-wide information is subject to communication delays, and existing neural approaches either assume unrealistic delay-free global states, or restrict routers to purely local telemetry. This leaves their deployability in real-world environments unclear. We cast telemetry-aware routing as a delay-aware closed-loop control problem and introduce a framework that trains and evaluates neural routing algorithms, while explicitly modeling communication and inference delays. On top of this framework, we propose LOGGIA, a scalable graph neural routing algorithm that predicts log-space link weights from attributed topology-and-telemetry graphs. It utilizes a data-driven pre-training stage, followed by on-policy Reinforcement Learning. Across synthetic and real network topologies, and unseen mixed TCP/UDP traffic sequences, LOGGIA consistently outperforms shortest-path baselines, whereas neural baselines fail once realistic delays are enforced. Our experiments further suggest that neural routing algorithms like LOGGIA perform best when deployed fully locally, i.e., observing network states and inferring actions at every router individually, as opposed to centralized decision making.

关键词: telemetry-aware routing, neural routing algorithms, graph neural networks, reinforcement learning, delay-aware control, network optimization, LOGGIA, TCP/UDP traffic

215. ❌ Efficient Logistic Regression with Mixture of Sigmoids

作者: Federico Di Gennaro, Saptarshi Chakraborty, Nikita Zhivotovskiy 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02920v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10
“Scaling Laws” AND “Data Quality”	1.0	0.0/10
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10
“World Models” AND “General World Models”	1.0	0.0/10
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10

评分理由: 论文研究在线逻辑回归中的指数权重算法，属于经典机器学习优化理论范畴。所有评分关键词均围绕大模型、深度学习技术及其应用，而本文完全不涉及这些主题。标题中的’Mixture of Sigmoids’是逻辑回归的数学表述，与’Mixture of Experts’无关。摘要未提及任何大模型、深度学习、AI for Science等相关内容，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

本文研究了在线逻辑回归中指数权重算法的计算效率和几何性质，显著降低了实现最优遗憾界的最坏情况计算复杂度，并分析了在线性可分条件下算法随参数B增大的收敛行为。

摘要翻译

本文研究了在线逻辑回归中采用各向同性高斯先验的指数权重（Exponential Weights, EW）算法。我们证明，针对范数至多为 $B$ 的最佳线性预测器，Kakade 和 Ng（2005）所建立的 EW 近乎最优的最坏情况遗憾界 $O(d\log(Bn))$，可以在总最坏情况计算复杂度 $O(B^3 n^5)$ 内实现。这显著改进了先前达到相同保证（Foster 等人，2018）的 $O(B^{18}n^{37})$ 复杂度。除了效率之外，我们分析了线性可分性下的大 $B$ 机制：经过 $B$ 的重新缩放后，EW 后验分布随着 $B\to\infty$ 收敛到一个截断至版本锥（version cone）的标准高斯分布。相应地，预测器收敛为对分离方向的立体角投票，并且在该锥体的每个固定间隔切片上，相应截断高斯分布的众数与硬间隔支持向量机（hard-margin SVM）方向对齐。利用此几何特性，我们推导了非渐近遗憾界，表明一旦 $B$ 超过一个依赖于间隔的阈值，遗憾将不再依赖于 $B$，并且仅随逆间隔的对数增长。总体而言，我们的结果表明，EW 算法在线性分类中既能实现计算上的可处理性，又能具备几何自适应性。

摘要 (Abstract)

This paper studies the Exponential Weights (EW) algorithm with an isotropic Gaussian prior for online logistic regression. We show that the near-optimal worst-case regret bound $O(d\log(Bn))$ for EW, established by Kakade and Ng (2005) against the best linear predictor of norm at most $B$, can be achieved with total worst-case computational complexity $O(B^3 n^5)$. This substantially improves on the $O(B^{18}n^{37})$ complexity of prior work achieving the same guarantee (Foster et al., 2018). Beyond efficiency, we analyze the large-$B$ regime under linear separability: after rescaling by $B$, the EW posterior converges as $B\to\infty$ to a standard Gaussian truncated to the version cone. Accordingly, the predictor converges to a solid-angle vote over separating directions and, on every fixed-margin slice of this cone, the mode of the corresponding truncated Gaussian is aligned with the hard-margin SVM direction. Using this geometry, we derive non-asymptotic regret bounds showing that once $B$ exceeds a margin-dependent threshold, the regret becomes independent of $B$ and grows only logarithmically with the inverse margin. Overall, our results show that EW can be both computationally tractable and geometrically adaptive in online classification.

关键词: Online Logistic Regression, Exponential Weights Algorithm, Worst-case Regret, Computational Complexity, Linear Separability, Hard-margin SVM, Geometric Analysis, Convergence Analysis

216. ❌ Scalable Mean-Variance Portfolio Optimization via Subspace Embeddings and GPU-Friendly Nesterov-Accelerated Projected Gradient

作者: Yi-Shuai Niu, Yajuan Wang 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02917v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于金融投资组合优化算法，开发了基于随机子空间嵌入和GPU加速的Nesterov加速投影梯度算法，用于大规模约束均值-方差投资组合优化。论文内容完全围绕数值优化、矩阵计算和金融工程，不涉及任何大语言模型、深度学习、AI for Science或相关技术原理。所有关键词均与大模型、深度学习、AI科学应用或相关技术无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于随机子空间嵌入和GPU加速的Nesterov加速投影梯度算法，用于高效求解大规模约束均值-方差投资组合优化问题，在保持目标精度的同时大幅减少了计算时间。

摘要翻译

我们开发了一种基于草图技术的因子降维方法，以及结合GPU加速的Nesterov加速投影梯度算法（NPGA），从而构建了一个针对大规模约束均值-方差投资组合优化问题的双重加速求解器。该方法从样本协方差因子$L$出发，结合随机子空间嵌入、谱截断和岭稳定化技术，构造出一个有效因子$L_{eff}$。随后，通过标量对偶搜索和GPU友好的矩阵-向量计算内核实现结构化投影，求解由此产生的约束优化问题，形成了一条统一处理基准模型、草图模型以及草图-截断-岭（STR）正则化模型的计算流程。我们还为草图模型和STR模型建立了近似性、条件数和稳定性保证，包括在$(\varepsilon,\delta)$-子空间嵌入下，对协方差近似误差、最优值误差及解扰动的显式$O(\varepsilon)$界。在合成和实际股票收益数据上的实验表明，该方法在保持目标函数精度的同时，显著减少了运行时间。在一个包含5440种资产、48374个训练周期的实际数据基准测试中，NPGA-GPU求解未降维的完整模型耗时2.80秒，而Gurobi需要64.84秒；经过优化的压缩GPU变体则保持在个位数秒级。这些结果表明，完整的稠密模型在现代GPU上已具备实用性，且在压缩后，计算瓶颈主要在于投影步骤而非矩阵-向量乘法。

摘要 (Abstract)

We develop a sketch-based factor reduction and a Nesterov-accelerated projected gradient algorithm (NPGA) with GPU acceleration, yielding a doubly accelerated solver for large-scale constrained mean-variance portfolio optimization. Starting from the sample covariance factor $L$, the method combines randomized subspace embedding, spectral truncation, and ridge stabilization to construct an effective factor $L_{eff}$. It then solves the resulting constrained problem with a structured projection computed by scalar dual search and GPU-friendly matrix-vector kernels, yielding one computational pipeline for the baseline, sketched, and Sketch-Truncate-Ridge (STR)-regularized models. We also establish approximation, conditioning, and stability guarantees for the sketching and STR models, including explicit $O(\varepsilon)$ bounds for the covariance approximation, the optimal value error, and the solution perturbation under $(\varepsilon,δ)$-subspace embeddings. Experiments on synthetic and real equity-return data show that the method preserves objective accuracy while reducing runtime substantially. On a 5440-asset real-data benchmark with 48374 training periods, NPGA-GPU solves the unreduced full model in 2.80 seconds versus 64.84 seconds for Gurobi, while the optimized compressed GPU variants remain in the low-single-digit-second regime. These results show that the full dense model is already practical on modern GPUs and that, after compression, the remaining bottleneck is projection rather than matrix-vector multiplication.

关键词: portfolio optimization, subspace embedding, GPU acceleration, Nesterov-accelerated projected gradient, mean-variance, sketching, covariance approximation, computational efficiency

217. ❌ Extracting Money Laundering Transactions from Quasi-Temporal Graph Representation

作者: Haseeb Tariq, Marwan Hassani 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02899v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于使用监督学习框架检测金融交易中的洗钱活动，核心方法是基于图表示（quasi-temporal graph）的机器学习模型。论文未涉及任何大语言模型（LLM）、深度学习技术原理创新、模型训练/微调方法（如预训练、指令调优、RLHF）、推理优化（如注意力机制、解码加速）、代理系统、模型压缩或科学AI应用。所有评分关键词均与大模型或深度学习技术直接相关，而本文属于传统机器学习在金融风控领域的应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为ExSTraQt的监督学习框架，通过准时序图表示来检测金融交易中的洗钱活动，在真实和合成数据集上实现了F1分数的提升。

摘要翻译

洗钱活动对全球金融机构构成持续挑战，犯罪组织不断更新其策略以规避监测系统。传统的反洗钱方法主要依赖预定义的风险规则，导致调查资源密集且误报警报数量居高不下。为了在每日处理数十亿笔交易的同时控制运营成本不致激增，金融机构正投资于更复杂的机制以改进现有系统。本文提出ExSTraQt（基于准时序图表示的可疑交易提取方法），这是一种先进的监督学习框架，用于在金融数据集中检测洗钱（或可疑）交易。与当前最先进的反洗钱检测模型相比，我们提出的框架在性能上表现卓越。该框架的核心优势在于其设计的纯粹简洁性（参数数量极少）以及可扩展性（计算和内存需求较低）。我们使用一个真实数据集和一组合成金融交易数据集，在交易级别检测准确率上评估了本框架。在多数数据集上，我们的方法持续提升了F1分数：真实数据集上最高提升1%，其中一个合成数据集上提升超过8%。我们还主张，该框架能够无缝补充银行现有的反洗钱检测系统。我们的代码与数据集已公开于https://github.com/mhaseebtariq/exstraqt。

摘要 (Abstract)

Money laundering presents a persistent challenge for financial institutions worldwide, while criminal organizations constantly evolve their tactics to bypass detection systems. Traditional anti-money laundering approaches mainly rely on predefined risk-based rules, leading to resource-intensive investigations and high numbers of false positive alerts. In order to restrict operational costs from exploding, while billions of transactions are being processed every day, financial institutions are investing in more sophisticated mechanisms to improve existing systems. In this paper, we present ExSTraQt (EXtract Suspicious TRAnsactions from Quasi-Temporal graph representation), an advanced supervised learning approach to detect money laundering (or suspicious) transactions in financial datasets. Our proposed framework excels in performance, when compared to the state-of-the-art AML (Anti Money Laundering) detection models. The key strengths of our framework are sheer simplicity, in terms of design and number of parameters; and scalability, in terms of the computing and memory requirements. We evaluated our framework on transaction-level detection accuracy using a real dataset; and a set of synthetic financial transaction datasets. We consistently achieve an uplift in the F1 score for most datasets, up to 1% for the real dataset; and more than 8% for one of the synthetic datasets. We also claim that our framework could seamlessly complement existing AML detection systems in banks. Our code and datasets are available at https://github.com/mhaseebtariq/exstraqt.

关键词: money laundering detection, supervised learning, quasi-temporal graph, financial transactions, AML systems, F1 score improvement, scalable framework

218. ❌ Lipschitz bounds for integral kernels

作者: Justin Reverdi, Sixin Zhang, Fabrice Gamboa, Serge Gratton 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02887v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是核方法中积分核的Lipschitz连续性理论分析，属于数学和机器学习理论领域，与所有评分关键词（均围绕大模型、深度学习技术原理及应用）完全无关。论文未涉及任何大模型、深度学习技术、AI应用或相关创新方法。

!!! tip deepseek-chat TL;DR

该论文研究了积分核特征映射的Lipschitz连续性，推导了Lipschitz常数的显式公式，并应用于高斯核、ReLU随机神经网络核等具体核函数，揭示了特征映射Lipschitz连续性与权重分布二阶矩有限性的等价关系。

摘要翻译

与正定核相关的特征映射在核方法与学习理论中占据核心地位，其利普希茨连续性等正则性质与鲁棒性和稳定性保证密切相关。尽管这些特征映射至关重要，但仅在有限情况下存在对其利普希茨常数的显式刻画。本文研究在可微性假设下，积分核所对应特征映射的利普希茨正则性。我们首先给出确保利普希茨连续性的充分条件，并推导相应利普希茨常数的显式公式。随后，我们识别出导致特征映射不满足利普希茨连续性的条件，并将这些结果应用于若干重要核函数类别。对于具有各向同性高斯权重分布的无限宽度两层神经网络，我们证明其关联核的利普希茨常数可表达为一个二维积分的上确界，从而为高斯核和ReLU随机神经网络核提供了显式刻画。我们还研究了高斯核、拉普拉斯核和Matérn核等连续平移不变核，这些核可解释为具有余弦激活函数的神经网络。在此框架下，我们证明特征映射满足利普希茨连续性的充要条件是权重分布具有有限二阶矩，并进一步推导出其利普希茨常数。最后，我们提出一个关于有限宽度神经网络中利普希茨常数收敛渐近行为的开放性问题。数值实验为这一行为提供了支持。

摘要 (Abstract)

Feature maps associated with positive definite kernels play a central role in kernel methods and learning theory, where regularity properties such as Lipschitz continuity are closely related to robustness and stability guarantees. Despite their importance, explicit characterizations of the Lipschitz constant of kernel feature maps are available only in a limited number of cases. In this paper, we study the Lipschitz regularity of feature maps associated with integral kernels under differentiability assumptions. We first provide sufficient conditions ensuring Lipschitz continuity and derive explicit formulas for the corresponding Lipschitz constants. We then identify a condition under which the feature map fails to be Lipschitz continuous and apply these results to several important classes of kernels. For infinite width two-layer neural network with isotropic Gaussian weight distributions, we show that the Lipschitz constant of the associated kernel can be expressed as the supremum of a two-dimensional integral, leading to an explicit characterization for the Gaussian kernel and the ReLU random neural network kernel. We also study continuous and shift-invariant kernels such as Gaussian, Laplace, and Matérn kernels, which admit an interpretation as neural network with cosine activation function. In this setting, we prove that the feature map is Lipschitz continuous if and only if the weight distribution has a finite second-order moment, and we then derive its Lipschitz constant. Finally, we raise an open question concerning the asymptotic behavior of the convergence of the Lipschitz constant in finite width neural networks. Numerical experiments are provided to support this behavior.

关键词: Lipschitz continuity, integral kernels, feature maps, kernel methods, neural networks, Gaussian kernel, ReLU kernel, Matérn kernel

219. ❌ Toward an Operational GNN-Based Multimesh Surrogate for Fast Flood Forecasting

作者: Valentin Mercier, Serge Gratton, Lapeyre Corentin, Gwenaël Chevallet 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02876v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于使用图神经网络（GNN）作为洪水预测的替代模型，属于AI在科学领域的应用（具体为水文学和计算物理），因此仅与关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（评5分），因为洪水预测可视为科学计算应用。论文未涉及大语言模型（LLM）、深度学习技术原理创新（如MoE、Scaling Laws、训练方法、推理优化、智能体等）或生物信息学/化学信息学，故其他关键词均评0分。

!!! tip deepseek-chat TL;DR

该研究针对洪水预测中高保真水力模拟计算耗时过长的问题，开发了一种基于图神经网络的多网格替代模型，在保持精度的同时将6小时预测时间从约180分钟大幅缩短至0.4秒。

摘要翻译

当前业务化洪水预报仍依赖于高保真度的二维水力学求解器，但其运行时间对于大型城市洪泛区的快速决策支持而言往往难以承受。与此同时，基于人工智能的代理模型在计算物理学的多个领域已展现出加速昂贵高保真模拟的强大潜力。本研究针对法国下泰特河（Têt River）流域，从一个生产级Telemac2D模型出发开展研究，该模型建立在包含超过$4\times 10^5$个节点的高分辨率非结构有限元网格上。基于此设置，我们构建了一个可用于机器学习、合成但基于业务场景的洪水事件数据库，其中涵盖了多个具有代表性的水文过程线类型及峰值流量。在此数据库基础上，我们开发了一种基于投影网格（projected meshes）与多网格连接（multimesh connectivity）的图神经网络代理模型。投影网格策略在保持来自原始Telemac模拟的高保真监督的同时，使训练易于处理；而多网格构造则在不增加网络深度的前提下，扩大了有效的空间感受野。我们进一步研究了显式流量特征$Q(t)$以及前推训练（pushforward training）对长时间自回归推演的影响。实验表明：在此边界驱动场景中，以$Q(t)$为条件至关重要；在模型得到适当条件化后，多网格连接能带来额外增益；而前推训练则进一步提升了推演的稳定性。在所有测试配置中，结合$Q(t)$、多网格连接和前推训练的组合取得了最佳整体效果。这些增益既体现在代理网格上的水力学变量，也体现在插值到统一$25,\mathrm{m}$规则网格上的淹没图与原始高分辨率Telemac解的比较中。在所研究的案例中，学习得到的代理模型在单张NVIDIA A100 GPU上生成6小时预测仅需约$0.4,\mathrm{s}$，而参考模拟在56个CPU核心上运行约需$180,\mathrm{min}$。这些结果表明，基于图的代理模型可作为工业级水力学求解器在业务化洪水淹没制图方面的实用补充。

摘要 (Abstract)

Operational flood forecasting still relies on high-fidelity two-dimensional hydraulic solvers, but their runtime can be prohibitive for rapid decision support on large urban floodplains. In parallel, AI-based surrogate models have shown strong potential in several areas of computational physics for accelerating otherwise expensive high-fidelity simulations. We address this issue on the lower Têt River (France), starting from a production-grade Telemac2D model defined on a high-resolution unstructured finite-element mesh with more than $4\times 10^5$ nodes. From this setup, we build a learning-ready database of synthetic but operationally grounded flood events covering several representative hydrograph families and peak discharges. On top of this database, we develop a graph-neural surrogate based on projected meshes and multimesh connectivity. The projected-mesh strategy keeps training tractable while preserving high-fidelity supervision from the original Telemac simulations, and the multimesh construction enlarges the effective spatial receptive field without increasing network depth. We further study the effect of an explicit discharge feature $Q(t)$ and of pushforward training for long autoregressive rollouts. The experiments show that conditioning on $Q(t)$ is essential in this boundary-driven setting, that multimesh connectivity brings additional gains once the model is properly conditioned, and that pushforward further improves rollout stability. Among the tested configurations, the combination of $Q(t)$, multimesh connectivity, and pushforward provides the best overall results. These gains are observed both on hydraulic variables over the surrogate mesh and on inundation maps interpolated onto a common $25,\mathrm{m}$ regular grid and compared against the original high-resolution Telemac solution. On the studied case, the learned surrogate produces 6-hour predictions in about $0.4,\mathrm{s}$ on a single NVIDIA A100 GPU, compared with about $180,\mathrm{min}$ on 56 CPU cores for the reference simulation. These results support graph-based surrogates as practical complements to industrial hydraulic solvers for operational flood mapping.

关键词: flood forecasting, graph neural network, surrogate model, Telemac2D, multimesh connectivity, hydraulic simulation, computational physics, operational forecasting

220. ❌ Transfer Learning for Loan Recovery Prediction under Distribution Shifts with Heterogeneous Feature Spaces

作者: Christopher Gerling, Hanqiu Peng, Ying Chen, Stefan Lessmann 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02832v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于信用风险领域的迁移学习应用，提出了一种名为FT-MDN-Transformer的混合密度表格Transformer架构，用于在异构特征空间下预测贷款回收率。论文的核心是迁移学习在金融领域的应用，而非大模型或深度学习技术原理的创新。所有评分关键词均与大模型、深度学习技术原理或AI在科学领域的应用直接相关，而本文研究的是传统的表格数据迁移学习问题，未涉及任何大模型技术、深度学习创新或AI在科学（如生物信息学、化学信息学）中的应用，因此所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了在数据稀缺和分布偏移条件下，使用迁移学习预测贷款回收率的问题，并提出了一种混合密度表格Transformer模型，实验表明该模型在目标域数据有限时优于基线模型，特别是在协变量和条件偏移下表现更佳。

摘要翻译

回收率（Recovery Rate, RR）的准确预测对于信用风险管理和监管资本计量至关重要。然而，在许多贷款组合中，由于违约事件发生频率低，RR建模常受限于数据稀缺问题。迁移学习（Transfer Learning, TL）通过利用来自相关但更丰富的源域信息，为缓解这一挑战提供了有前景的途径；但其有效性关键取决于分布偏移的存在与强度，以及源域与目标域特征空间之间潜在的异质性。
本文提出了FT-MDN-Transformer，一种专门为异质特征集下的RR预测迁移学习设计的混合密度表格Transformer架构。该模型能够同时生成贷款级别的点估计与组合级别的预测分布，从而支持广泛的RR预测实际应用。我们通过受控的蒙特卡洛模拟评估了所提出的方法，该模拟便于系统改变协变量偏移、条件偏移和标签偏移；同时也在真实迁移场景中进行了评估，使用全球信用数据（Global Credit Data, GCD）贷款数据集作为源域，并以一个新的债券数据集作为目标域。
研究结果表明，当目标域数据有限时，FT-MDN-Transformer优于基线模型，在协变量偏移和条件偏移下收益尤为显著，而标签偏移仍具挑战性。我们还观察到其概率预测能够紧密跟踪经验回收分布，提供了比传统点预测指标更丰富的信息。总体而言，这些发现凸显了具有分布感知能力的迁移学习架构在改善数据稀缺信用组合中RR预测的潜力，并为在异质数据环境下运作的风险管理者提供了实用见解。

摘要 (Abstract)

Accurate forecasting of recovery rates (RR) is central to credit risk management and regulatory capital determination. In many loan portfolios, however, RR modeling is constrained by data scarcity arising from infrequent default events. Transfer learning (TL) offers a promising avenue to mitigate this challenge by exploiting information from related but richer source domains, yet its effectiveness critically depends on the presence and strength of distributional shifts, and on potential heterogeneity between source and target feature spaces. This paper introduces FT-MDN-Transformer, a mixture-density tabular Transformer architecture specifically designed for TL in RR forecasting across heterogeneous feature sets. The model produces both loan-level point estimates and portfolio-level predictive distributions, thereby supporting a wide range of practical RR forecasting applications. We evaluate the proposed approach in a controlled Monte Carlo simulation that facilitates systematic variation of covariate, conditional, and label shifts, as well as in a real-world transfer setting using the Global Credit Data (GCD) loan dataset as source and a novel bonds dataset as target. Our results show that FT-MDN-Transformer outperforms baseline models when target-domain data are limited, with particularly pronounced gains under covariate and conditional shifts, while label shift remains challenging. We also observe its probabilistic forecasts to closely track empirical recovery distributions, providing richer information than conventional point-prediction metrics alone. Overall, the findings highlight the potential of distribution-aware TL architectures to improve RR forecasting in data-scarce credit portfolios and offer practical insights for risk managers operating under heterogeneous data environments.

关键词: Transfer Learning, Loan Recovery Prediction, Distribution Shifts, Heterogeneous Feature Spaces, Mixture-Density Transformer, Credit Risk Management, Monte Carlo Simulation, Global Credit Data

221. ❌ Structure-Aware Commitment Reduction for Network-Constrained Unit Commitment with Solver-Preserving Guarantees

作者: Guangwen Wang, Jiaqi Wu, Yang Weng, Baosen Zhang 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02788v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文主要研究电力系统优化中的网络约束机组组合问题，提出了一种基于结构感知的降维框架。论文明确提到使用LLM（大语言模型）来选择稀疏的二进制变量子集，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（8分）。论文属于AI在科学领域的应用（电力系统优化），因此与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分）。论文未涉及其他关键词的具体技术内容，如MoE、SLMs、训练方法、推理优化、代理系统等，因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种结构感知的降维框架，利用大语言模型辅助选择稀疏的二进制变量子集，以加速网络约束机组组合问题的求解，在保持接近最优解的同时实现了数量级的速度提升。

摘要翻译

随着个体发电单元、混合资源及安全约束的不断增多，网络约束机组组合问题的计算负担显著加重，其大部分求解时间消耗在基于机组-小时二元变量的分支定界树搜索上。为降低这一组合负担，近期研究开始探索基于学习的引导方法来辅助机组启停决策。然而，直接使用大型语言模型等工具预测完整的机组组合方案并不可靠，因为不可行或不一致的二元决策可能违反时序约束并损害经济最优性。本文提出一种与求解器兼容的机组组合降维框架，该框架利用机组启停决策中的结构规律性。该框架不生成完整调度方案，而是在优化前识别结构稳定的稀疏二元变量子集并将其固定。其中一种实现方式使用大型语言模型来选择这些变量。大型语言模型并不替代优化过程，而是提供部分变量限定，而所有约束及剩余决策仍由原混合整数线性规划求解器处理，继续严格满足网络、爬坡、备用及安全约束。我们通过形式化证明表明，经掩码处理的问题定义了原机组组合模型的一个缩减可行域，从而保留了可行性，并确保在限定空间内获得求解器可验证的最优解。在IEEE 57节点、RTS 73节点、IEEE 118节点及增强型大规模算例（包括安全约束变体）上的实验表明，该方法能持续减少分支定界节点数与求解时间，在高复杂度场景中实现数量级加速，同时保持接近最优的目标函数值。

摘要 (Abstract)

The growing number of individual generating units, hybrid resources, and security constraints has significantly increased the computational burden of network-constrained unit commitment (UC), where most solution time is spent exploring branch-and-bound trees over unit-hour binary variables. To reduce this combinatorial burden, recent approaches have explored learning-based guidance to assist commitment decisions. However, directly using tools such as large language models (LLMs) to predict full commitment schedules is unreliable, as infeasible or inconsistent binary decisions can violate inter-temporal constraints and degrade economic optimality. This paper proposes a solver-compatible dimensionality reduction framework for UC that exploits structural regularities in commitment decisions. Instead of generating complete schedules, the framework identifies a sparse subset of structurally stable commitment binaries to fix prior to optimization. One implementation uses an LLM to select these variables. The LLM does not replace the optimization process but provides partial variable restriction, while all constraints and remaining decisions are handled by the original MILP solver, which continues to enforce network, ramping, reserve, and security constraints. We formally show that the masked problem defines a reduced feasible region of the original UC model, thereby preserving feasibility and enabling solver-certified optimality within the restricted space. Experiments on IEEE 57-bus, RTS 73-bus, IEEE 118-bus, and augmented large-scale cases, including security-constrained variants, demonstrate consistent reductions in branch-and-bound nodes and solution time, achieving order-of-magnitude speedups on high-complexity instances while maintaining near-optimal objective values.

关键词: network-constrained unit commitment, structure-aware commitment reduction, large language models, MILP solver, branch-and-bound, dimensionality reduction, solver-preserving guarantees, computational efficiency

222. ❌ Towards Realistic Class-Incremental Learning with Free-Flow Increments

作者: Zhiming Xu, Baile Xu, Jian Zhao, Furao Shen, Suorong Yang 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02765v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是类增量学习（Class-Incremental Learning, CIL）中的Free-Flow CIL（FFCIL）问题，提出了一种模型无关的框架来应对新类数量高度可变的数据流。论文内容完全聚焦于传统的机器学习/深度学习中的增量学习问题，没有涉及任何大语言模型（LLM）、大模型技术原理、AI for Science应用或相关关键词。所有关键词均与大模型、深度学习技术原理或科学AI应用相关，而本文是传统的计算机视觉/机器学习任务，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

本文提出了Free-Flow Class-Incremental Learning（FFCIL）这一更现实的类增量学习设定，其中新类数量高度可变，并开发了一个包含类平均目标和方法调整的模型无关框架来稳定学习过程，实验表明该框架能有效提升多种CIL基线在FFCIL下的性能。

摘要翻译

类增量学习（Class-incremental learning, CIL）通常在预定义的、任务规模均等的调度方案下进行评估，而更现实且复杂的场景尚未得到充分探索。然而，一个实用的CIL系统应能在任意数量的新类别到达时立即进行学习，无需强制限定任务规模。我们将这一设定形式化为自由流类增量学习（Free-Flow Class-Incremental Learning, FFCIL），其中数据以更贴近现实的流式方式到达，每一步中未见类别的数量具有高度可变性。这将导致许多现有CIL方法变得脆弱，并引发明显的性能下降。我们提出了一种模型无关的框架，用于在自由流到达场景下实现稳健的CIL学习。该框架包含类均值（class-wise mean, CWM）目标，该目标以均匀聚合的类条件监督替代了基于样本频率加权的损失，从而在自由流类别增量过程中稳定学习信号；同时还包括针对代表性CIL范式的适应性调整以增强其鲁棒性。具体而言，我们将蒸馏约束于回放数据，对对比损失与知识迁移损失的尺度进行归一化，并引入动态干预权重对齐（Dynamic Intervention Weight Alignment, DIWA）以防止因小规模类别增量带来的不稳定统计量所导致的过度调整。实验证实，在FFCIL设定下，多种CIL基线方法均出现明显的性能下降，而我们的策略则带来了持续的性能提升。

摘要 (Abstract)

Class-incremental learning (CIL) is typically evaluated under predefined schedules with equal-sized tasks, leaving more realistic and complex cases unexplored. However, a practical CIL system should learns immediately when any number of new classes arrive, without forcing fixed-size tasks. We formalize this setting as Free-Flow Class-Incremental Learning (FFCIL), where data arrives as a more realistic stream with a highly variable number of unseen classes each step. It will make many existing CIL methods brittle and lead to clear performance degradation. We propose a model-agnostic framework for robust CIL learning under free-flow arrivals. It comprises a class-wise mean (CWM) objective that replaces sample frequency weighted loss with uniformly aggregated class-conditional supervision, thereby stabilizing the learning signal across free-flow class increments, as well as method-wise adjustments that improve robustness for representative CIL paradigms. Specifically, we constrain distillation to replayed data, normalize the scale of contrastive and knowledge transfer losses, and introduce Dynamic Intervention Weight Alignment (DIWA) to prevent over-adjustment caused by unstable statistics from small class increments. Experiments confirm a clear performance degradation across various CIL baselines under FFCIL, while our strategies yield consistent gains.

关键词: Class-incremental learning, Free-Flow CIL, Incremental learning, Continual learning, Class-wise mean objective, Dynamic Intervention Weight Alignment, Robust learning, Data stream

223. ❌ STDDN: A Physics-Guided Deep Learning Framework for Crowd Simulation

作者: Zijin Liu, Xu Geng, Wenshuai Xu, Xiang Zhao, Yan Xia, You Song 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02756v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文STDDN专注于人群模拟的物理引导深度学习框架，使用Neural ODE和流体动力学约束来改进轨迹预测。所有评分关键词均与大语言模型（LLMs）、模型训练技术、推理优化、AI对齐、智能体系统等大模型核心技术相关，而本文研究的是特定领域（人群模拟）的深度学习应用，未涉及任何大语言模型技术、训练方法或相关创新。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种物理引导的深度学习框架STDDN，通过引入流体动力学连续性方程作为物理约束，使用神经常微分方程建模宏观密度演化，解决了人群模拟中微观轨迹预测的误差积累和计算效率问题，在多个真实数据集上实现了优于现有方法的仿真性能和显著降低的推理延迟。

摘要翻译

精确的人群模拟对于公共安全管理、应急疏散规划和智能交通系统至关重要。然而，现有方法通常将人群建模为独立个体轨迹的集合，在捕捉宏观物理规律方面存在局限。这种微观方法常导致误差累积并损害模拟稳定性。此外，深度学习驱动的方法往往存在推理效率低、计算开销高的问题，难以适用于大规模高效模拟。为应对这些挑战，我们提出时空解耦微分方程网络（Spatio-Temporal Decoupled Differential Equation Network, STDDN），这是一种通过宏观物理规律指导微观轨迹预测的新型框架。我们创新性地引入流体动力学中的连续性方程作为强物理约束，采用神经常微分方程（Neural Ordinary Differential Equation, Neural ODE）对个体运动驱动的宏观密度演化进行建模，从而对微观轨迹预测模型进行物理正则化。我们设计了一个密度-速度耦合动态图学习模块，用于在神经常微分方程内构建密度场的导数，有效缓解误差累积。同时，我们提出可微分密度映射模块以消除离散化导致的梯度不连续问题，并引入跨网格检测模块来精确建模个体跨网格运动对局部密度变化的影响。在四个真实世界数据集上的长期任务测试中，所提出的STDDN方法相比现有先进技术展现出显著优越的模拟性能，同时推理延迟大幅降低。

摘要 (Abstract)

Accurate crowd simulation is crucial for public safety management, emergency evacuation planning, and intelligent transportation systems. However, existing methods, which typically model crowds as a collection of independent individual trajectories, are limited in their ability to capture macroscopic physical laws. This microscopic approach often leads to error accumulation and compromises simulation stability. Furthermore, deep learning-driven methods tend to suffer from low inference efficiency and high computational overhead, making them impractical for large-scale, efficient simulations. To address these challenges, we propose the Spatio-Temporal Decoupled Differential Equation Network (STDDN), a novel framework that guides microscopic trajectory prediction with macroscopic physics. We innovatively introduce the continuity equation from fluid dynamics as a strong physical constraint. A Neural Ordinary Differential Equation (Neural ODE) is employed to model the macroscopic density evolution driven by individual movements, thereby physically regularizing the microscopic trajectory prediction model. We design a density-velocity coupled dynamic graph learning module to formulate the derivative of the density field within the Neural ODE, effectively mitigating error accumulation. We also propose a differentiable density mapping module to eliminate discontinuous gradients caused by discretization and introduce a cross-grid detection module to accurately model the impact of individual cross-grid movements on local density changes. The proposed STDDN method has demonstrated significantly superior simulation performance compared to state-of-the-art methods on long-term tasks across four real-world datasets, as well as a major reduction in inference latency.

关键词: crowd simulation, physics-guided deep learning, Neural ODE, continuity equation, trajectory prediction, density-velocity coupling, inference efficiency, spatio-temporal modeling

224. ❌ Understanding Latent Diffusability via Fisher Geometry

作者: Jing Gu, Morteza Mardani, Wonjun Lee, Dongmian Zou, Gilad Lerman 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02751v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究扩散模型在潜在空间中的可扩散性问题，提出基于Fisher信息几何的理论框架来量化潜在空间扩散能力，并分解为Fisher信息和Fisher信息率贡献。所有评分关键词均与大语言模型、模型训练优化、推理加速、对齐、代理系统等大模型技术相关，而本文专注于扩散模型的几何分析，与这些关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文研究了扩散模型在潜在空间中性能下降的原因，通过Fisher几何框架量化潜在空间可扩散性，并分解为Fisher信息和Fisher信息率贡献，提出了诊断和缓解潜在扩散失败的理论条件和度量方法。

摘要翻译

扩散模型在潜在空间（如变分自编码器VAEs）中训练时性能常出现下降，但其形式化成因尚未得到充分理解。我们通过沿扩散轨迹的最小均方误差变化率来量化潜在空间的可扩散性。本框架将该MMSE变化率分解为费舍尔信息与费舍尔信息变化率的贡献。研究表明，虽然全局等距性保证FI对齐，但FIR由编码器的局部几何特性决定。我们的分析将潜在几何畸变显式解耦为三个可量化惩罚项：维度压缩、切向畸变与曲率注入。我们推导了跨空间保持FIR的理论条件，从而确保可扩散性得以维持。跨多种自编码架构的实验验证了本框架的有效性，并证明这些高效的FI与FIR指标可作为识别和缓解潜在扩散失败的稳健诊断工具集。

摘要 (Abstract)

Diffusion models often degrade when trained in latent spaces (e.g., VAEs), yet the formal causes remain poorly understood. We quantify latent-space diffusability through the rate of change of the Minimum Mean Squared Error (MMSE) along the diffusion trajectory. Our framework decomposes this MMSE rate into contributions from Fisher Information (FI) and Fisher Information Rate (FIR). We demonstrate that while global isometry ensures FI alignment, FIR is governed by the encoder’s local geometric properties. Our analysis explicitly decouples latent geometric distortion into three measurable penalties: dimensional compression, tangential distortion, and curvature injection. We derive theoretical conditions for FIR preservation across spaces, ensuring maintained diffusability. Experiments across diverse autoencoding architectures validate our framework and establish these efficient FI and FIR metrics as a robust diagnostic suite for identifying and mitigating latent diffusion failure.

关键词: Diffusion Models, Latent Space, Fisher Information, Fisher Information Rate, Geometric Distortion, Autoencoding Architectures, MMSE Rate, Diffusability

225. ❌ State estimations and noise identifications with intermittent corrupted observations via Bayesian variational inference

作者: Peng Sun, Ruoyu Wang, Xue Luo 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02738v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于分布式传感器网络中的状态估计问题，采用贝叶斯变分推理和自适应卡尔曼滤波方法处理数据包丢失、观测损坏和未知噪声协方差，属于传统信号处理、控制理论和统计推断领域，与所有大模型、深度学习、AI科学应用等关键词完全无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种变分贝叶斯自适应卡尔曼滤波器（VB-AKF），用于解决分布式传感器网络中同时存在间歇性数据包丢失、损坏观测和未知噪声协方差的状态估计问题，并通过数值实验验证了该方法的有效性和渐近最优性。

摘要翻译

本文针对分布式传感器网络中同时存在间歇性数据包丢失、观测值损坏与未知噪声协方差的状态估计问题展开研究。为应对这一挑战，我们将系统状态、噪声参数及网络可靠性的联合估计建模为贝叶斯变分推断问题，并提出一种新型变分贝叶斯自适应卡尔曼滤波器（Variational Bayesian Adaptive Kalman Filter, VB-AKF）以近似潜在参数的联合后验概率密度。与现有分别处理缺失数据和测量异常值的自适应卡尔曼滤波器不同，所提出的VB-AKF采用包含两个独立伯努利随机变量的双掩码生成模型，显式刻画了可观测的通信丢失与潜在数据真实性。此外，该滤波器将多节点并发观测集成到自适应滤波框架中，显著提升了统计可辨识性。综合数值实验验证了所提方法的有效性与渐近最优性，结果表明随着传感器数量的增加，参数辨识与状态估计均渐近收敛至理论最优下界。

摘要 (Abstract)

This paper focuses on the state estimation problem in distributed sensor networks, where intermittent packet dropouts, corrupted observations, and unknown noise covariances coexist. To tackle this challenge, we formulate the joint estimation of system states, noise parameters, and network reliability as a Bayesian variational inference problem, and propose a novel variational Bayesian adaptive Kalman filter (VB-AKF) to approximate the joint posterior probability densities of the latent parameters. Unlike existing AKF that separately handle missing data and measurement outliers, the proposed VB-AKF adopts a dual-mask generative model with two independent Bernoulli random variables, explicitly characterizing both observable communication losses and latent data authenticity. Additionally, the VB-AKF integrates multiple concurrent multiple observations into the adaptive filtering framework, which significantly enhances statistical identifiability. Comprehensive numerical experiments verify the effectiveness and asymptotic optimality of the proposed method, showing that both parameter identification and state estimation asymptotically converge to the theoretical optimal lower bound with the increase in the number of sensors.

关键词: state estimation, distributed sensor networks, variational Bayesian inference, adaptive Kalman filter, intermittent packet dropouts, corrupted observations, noise identification, asymptotic optimality

226. ❌ FluxMoE: Decoupling Expert Residency for High-Performance MoE Serving

作者: Qingxiu Liu, Cyril Y. He, Hanser Jiang, Zion Wang, Alan Zhao, Patrick P. C. Lee 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02715v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	15.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	10.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文FluxMoE专注于MoE（Mixture of Experts）模型的推理系统优化，核心贡献是提出一种解耦专家参数与GPU常驻内存的方法，以提升推理吞吐量。因此，与’Mixture of Experts OR MoE OR Sparse Models’高度相关（15分），因为这是论文的核心主题。同时，论文涉及KV缓存（KV Cache）的管理和推理加速，与’KV Cache Compression OR Linear Attention OR FlashAttention’和’Speculative Decoding OR Inference Acceleration’相关（各10分）。论文也基于大语言模型（LLMs）背景，与’Large Language Models OR LLMs OR Foundation Models’相关（10分）。其他关键词如小模型、训练方法、对齐、代理等均未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

FluxMoE通过解耦专家参数与GPU常驻内存，优化MoE模型的推理系统，在内存受限条件下实现了最高3.0倍的吞吐量提升，而不损害模型保真度。

摘要翻译

专家混合模型已成为扩展大语言模型的主流范式，但其快速增长的参数量在推理过程中引入了根本性的低效问题：大部分专家权重在GPU内存中保持闲置状态，却与性能关键运行时状态（如键值缓存）竞争内存资源。由于键值缓存的容量直接决定服务吞吐量，这种资源错配导致内存利用率不足和性能下降。本文提出FluxMoE——一种新型专家混合模型推理系统，该系统将专家参数从持久化GPU驻留中解耦。FluxMoE引入了专家分页抽象机制，将专家权重视为可流式传输的瞬时资源，按需将其物化至内存并在使用后立即驱逐，从而优先将GPU内存分配给对吞吐量至关重要的运行时状态。我们在vLLM推理引擎基础上实现了FluxMoE，使其能在严格内存限制下实现高效的专家混合模型推理。实验结果表明，在内存密集型场景中，FluxMoE相比vLLM可获得最高3.0倍的吞吐量提升，且不损害模型精度。

摘要 (Abstract)

Mixture-of-Experts (MoE) models have become a dominant paradigm for scaling large language models, but their rapidly growing parameter sizes introduce a fundamental inefficiency during inference: most expert weights remain idle in GPU memory while competing with performance-critical runtime state such as the key-value (KV) cache. Since KV cache capacity directly determines serving throughput, this mismatch leads to underutilized memory and degraded performance. In this paper, we present FluxMoE, a new MoE inference system that decouples expert parameters from persistent GPU residency. FluxMoE introduces an expert paging abstraction that treats expert weights as streamed, transient resources, materializing them on demand and evicting them immediately after use, allowing GPU memory to be preferentially allocated to throughput-critical runtime state. We implement FluxMoE atop vLLM to enable efficient MoE inference under severe memory constraints. Experimental results demonstrate that FluxMoE achieves up to 3.0$\times$ throughput gains over vLLM in memory-intensive regimes, without compromising model fidelity.

关键词: Mixture-of-Experts, MoE, inference system, GPU memory, KV cache, throughput, vLLM, expert paging

227. ❌ LieTrunc-QNN: Lie Algebra Truncation and Quantum Expressivity Phase Transition from LiePrune to Provably Stable Quantum Neural Networks

作者: Haijian Shao, Dalong Zhao, Xing Deng, Wenzheng Zhu, Yingtao Jiang 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02697v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究量子机器学习（QML）中的量子神经网络（QNN），核心是使用李代数理论解决量子电路训练中的梯度消失和噪声鲁棒性问题。所有关键词均针对大语言模型（LLM）和深度学习技术，而本文专注于量子计算领域，与LLM/深度学习技术原理、训练方法、推理优化、对齐、代理系统等均无直接关联。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为量子机器学习可视为AI在科学计算中的应用，但论文未明确涉及生物信息学或化学信息学，因此给予5分（有一定关联）。其他关键词均评0分（完全无关）。

!!! tip deepseek-chat TL;DR

该论文针对量子机器学习中梯度消失和噪声鲁棒性的核心挑战，提出了LieTrunc-QNN框架，通过李代数截断理论证明了多项式可训练性并实现了稳定的梯度保持。

摘要翻译

量子机器学习（QML）从根本上受到两大挑战的限制：贫瘠高原（梯度指数级消失）以及参数化量子电路在噪声下的脆弱性。尽管已有大量实证研究，但仍缺乏统一的理论框架。
我们提出LieTrunc-QNN，一种代数几何框架，通过李群生成的动力学来刻画可训练性。参数化量子电路被建模为u(2^n)的李子代数，其作用诱导出一个可达量子状态的黎曼流形。表达能力被重新解释为流形的内在维数与几何结构。
我们建立了一个几何容量-高原原理：由于测度集中效应，增加有效维度会导致梯度指数级抑制。通过限制在结构化的李子代数（LieTrunc）上，流形被压缩，从而防止测度集中并保持非退化梯度。
我们证明了两个主要结果：（1）LieTrunc-QNN的可训练性下界；（2）Fubini-Study度量秩受生成元代数张成空间的限制，表明表达能力由代数结构而非参数数量决定。紧致李子代数还提供了对扰动的固有鲁棒性。
重要的是，我们建立了一个多项式可训练性区域，其中梯度方差以多项式速率衰减而非指数速率。
实验（n=2-6）验证了该理论：LieTrunc-QNN保持了稳定的梯度和高有效维度，而随机截断则导致度量秩崩溃。在n=6时，完整的度量秩得以保持（秩=16）。结果支持梯度方差与有效维度之间的标度律。
这项工作为量子神经网络设计提供了一个统一的几何框架，将李代数、流形几何与优化理论联系起来。

摘要 (Abstract)

Quantum Machine Learning (QML) is fundamentally limited by two challenges: barren plateaus (exponentially vanishing gradients) and the fragility of parameterized quantum circuits under noise. Despite extensive empirical studies, a unified theoretical framework remains lacking. We introduce LieTrunc-QNN, an algebraic-geometric framework that characterizes trainability via Lie-generated dynamics. Parameterized quantum circuits are modeled as Lie subalgebras of u(2^n), whose action induces a Riemannian manifold of reachable quantum states. Expressivity is reinterpreted as intrinsic manifold dimension and geometry. We establish a geometric capacity-plateau principle: increasing effective dimension leads to exponential gradient suppression due to concentration of measure. By restricting to structured Lie subalgebras (LieTrunc), the manifold is contracted, preventing concentration and preserving non-degenerate gradients. We prove two main results: (1) a trainability lower bound for LieTrunc-QNN, and (2) that the Fubini-Study metric rank is bounded by the algebraic span of generators, showing expressivity is governed by structure rather than parameter count. Compact Lie subalgebras also provide inherent robustness to perturbations. Importantly, we establish a polynomial trainability regime where gradient variance decays polynomially instead of exponentially. Experiments (n=2-6) validate the theory: LieTrunc-QNN maintains stable gradients and high effective dimension, while random truncation leads to metric rank collapse. At n=6, full metric rank is preserved (rank=16). Results support a scaling law between gradient variance and effective dimension. This work provides a unified geometric framework for QNN design, linking Lie algebra, manifold geometry, and optimization.

关键词: Quantum Machine Learning, Quantum Neural Networks, Lie Algebra, Barren Plateaus, Gradient Stability, Trainability, Manifold Geometry, Fubini-Study Metric

228. ❌ Cross-subject Muscle Fatigue Detection via Adversarial and Supervised Contrastive Learning with Inception-Attention Network

作者: Zitao Lin, Chang Zhu, Wei Meng 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02670v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于使用深度学习（Inception-attention网络、对抗学习、监督对比学习）进行跨受试者肌肉疲劳检测，属于生物医学信号处理领域。所有关键词均与大语言模型（LLM）相关技术、训练方法、推理优化、对齐、代理系统等直接相关，而本文未涉及任何LLM技术，也未提及大模型在科学领域的应用。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为肌肉疲劳检测可视为生物信息学或AI在科学（生物医学）中的应用，但论文本身未明确提及这些术语，且核心是特定深度学习模型而非通用AI科学应用，因此给予5分（有一定关联）。其他关键词与论文内容完全无关，均得0分。

!!! tip deepseek-chat TL;DR

该研究提出了一种结合Inception-attention模块、对抗学习和监督对比学习的神经网络，用于解决跨受试者肌肉疲劳检测中特征不稳定的问题，在三分分类任务中达到了93.54%的准确率。

摘要翻译

肌肉疲劳检测在物理康复领域具有重要作用。先前研究表明，相较于其他生物信号，表面肌电信号在检测肌肉疲劳方面具有更高的敏感性。然而，从表面肌电信号中提取的特征在动态收缩期间及不同受试者间可能存在差异，这导致疲劳检测结果不稳定。为应对这些挑战，本研究提出一种新型神经网络，其包含作为特征提取器的Inception-attention模块、疲劳分类器以及配备梯度反转层的域分类器。该集成域分类器促使网络学习具有受试者不变性的通用疲劳特征，同时最小化受试者特异性特征。此外，本研究还采用监督对比损失函数以增强模型的泛化能力。实验结果表明，所提模型在三分类任务中取得了优异性能，准确率达到93.54%，召回率达92.69%，F1分数达92.69%，为跨受试者肌肉疲劳检测提供了稳健解决方案，对康复训练与辅助具有重要指导意义。

摘要 (Abstract)

Muscle fatigue detection plays an important role in physical rehabilitation. Previous researches have demonstrated that sEMG offers superior sensitivity in detecting muscle fatigue compared to other biological signals. However, features extracted from sEMG may vary during dynamic contractions and across different subjects, which causes unstability in fatigue detection. To address these challenges, this research proposes a novel neural network comprising an Inception-attention module as a feature extractor, a fatigue classifier and a domain classifier equipped with a gradient reversal layer. The integrated domain classifier encourages the network to learn subject-invariant common fatigue features while minimizing subject-specific features. Furthermore, a supervised contrastive loss function is also employed to enhance the generalization capability of the model. Experimental results demonstrate that the proposed model achieved outstanding performance in three-class classification tasks, reaching 93.54% accuracy, 92.69% recall and 92.69% F1-score, providing a robust solution for cross-subject muscle fatigue detection, offering significant guidance for rehabilitation training and assistance.

关键词: muscle fatigue detection, sEMG, cross-subject, adversarial learning, supervised contrastive learning, Inception-attention network, domain adaptation, rehabilitation

229. ❌ A Numerical Method for Coupling Parameterized Physics-Informed Neural Networks and FDM for Advanced Thermal-Hydraulic System Simulation

作者: Jeesuk Shin, Donggyun Seo, Sihyeong Yu, Joongoo Jeon 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02663v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究的是参数化物理信息神经网络（PINNs）与有限差分法（FDM）的耦合方法，用于核热工水力系统模拟。论文核心是数值方法和物理建模，而非大语言模型或深度学习技术原理的创新。所有关键词（除最后一个外）都直接涉及大语言模型、深度学习技术或相关概念，与论文内容完全无关。最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’与论文有一定关联，因为论文属于AI在科学（核工程）领域的应用，但论文未涉及生物信息学或化学信息学，且AI应用是物理建模而非大模型，因此给5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文针对核安全评估中系统级代码计算成本高的问题，开发了一种参数化物理信息神经网络与有限差分法耦合的P2F方法，实现了无需重训练和数据的高精度热工水力系统模拟。

摘要翻译

使用MELCOR等系统级程序进行严重事故分析是核安全评估不可或缺的环节，然而重复模拟的计算成本构成了参数化研究和不确定性量化的显著瓶颈。现有代理模型虽能加速此类分析，但依赖于大量模拟数据；而物理信息神经网络（PINNs）虽可实现无数据训练，却必须在问题参数每次变更时重新训练。本研究通过开发参数化PINNs耦合有限差分法（P2F）方法，同时解决了这两项局限。P2F是一种针对MELCOR控制体流体力学/流道（CVH/FP）模块设计的节点分配混合框架。在该方法中，参数化的节点分配PINN（NA-PINN）以水位差、初始速度和时间作为输入，通过学习解流形，使得单个训练完成的网络无需重新训练即可作为适用于所有流道的动量守恒方程的无数据代理模型。该PINN与有限差分法（FDM）求解器耦合，后者在每个时间步推进质量守恒方程的求解，在确保精确离散质量守恒的同时，将迭代非线性动量求解替换为单次前向传播。在六水箱重力驱动排水场景的验证中，名义工况下（$Δt = 1.0$ s）水位平均绝对误差为$7.85 \times 10^{-5}$ m，速度平均绝对误差为$3.21 \times 10^{-3}$ m/s。该框架在0.2至1.0 s的时间步长范围内保持一致的精度，并可泛化至五种不同的初始条件，整个过程无需重新训练或使用模拟数据。本研究提出了一种在核热工水力系统程序框架内将参数化PINNs与FDM相融合的数值耦合方法。

摘要 (Abstract)

Severe accident analysis using system-level codes such as MELCOR is indispensable for nuclear safety assessment, yet the computational cost of repeated simulations poses a significant bottleneck for parametric studies and uncertainty quantification. Existing surrogate models accelerate these analyses but depend on large volumes of simulation data, while physics-informed neural networks (PINNs) enable data-free training but must be retrained for every change in problem parameters. This study addresses both limitations by developing the Parameterized PINNs coupled with FDM (P2F) method, a node-assigned hybrid framework for MELCOR’s Control Volume Hydrodynamics/Flow Path (CVH/FP) module. In the P2F method, a parameterized Node-Assigned PINN (NA-PINN) accepts the water-level difference, initial velocity, and time as inputs, learning a solution manifold so that a single trained network serves as a data-free surrogate for the momentum conservation equation across all flow paths without retraining. This PINN is coupled with a finite difference method (FDM) solver that advances the mass conservation equation at each time step, ensuring exact discrete mass conservation while replacing the iterative nonlinear momentum solve with a single forward pass. Verification on a six-tank gravity-driven draining scenario yields a water level mean absolute error of $7.85 \times 10^{-5}$ m and a velocity mean absolute error of $3.21 \times 10^{-3}$ m/s under the nominal condition with $Δt = 1.0$ s. The framework maintains consistent accuracy across time steps ranging from 0.2 to 1.0 s and generalizes to five distinct initial conditions, all without retraining or simulation data. This work introduces a numerical coupling methodology for integrating parameterized PINNs with FDM within a nuclear thermal-hydraulic system code framework.

关键词: Parameterized PINNs, Finite Difference Method, Thermal-Hydraulic Simulation, Nuclear Safety Assessment, Data-Free Surrogate Model, Momentum Conservation, Mass Conservation, Node-Assigned Hybrid Framework

230. ❌ Product-Stability: Provable Convergence for Gradient Descent on the Edge of Stability

作者: Eric Gan 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02653v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究梯度下降在稳定性边缘（Edge of Stability）的收敛性理论，属于深度学习优化理论领域。论文内容聚焦于损失函数结构（product-stability）、梯度下降算法和收敛性证明，不涉及大模型、语言模型、微调、对齐、推理加速、AI for Science等关键词相关的具体技术或应用。所有关键词均与论文核心内容无关，因此全部评分为0分。

!!! tip deepseek-chat TL;DR

该论文研究了梯度下降在稳定性边缘（EoS）训练时的收敛性问题，通过引入product-stability这一损失函数结构性质，证明了对于形式为l(xy)的目标函数，梯度下降即使在EoS机制下也能收敛到局部最小值，并解释了稳定振荡的出现。

摘要翻译

经验表明，现代深度学习训练常发生于稳定性边界（Edge of Stability，EoS）区域，此时损失的锐度超过了经典收敛分析所适用的阈值。尽管近期研究已取得进展，现有关于EoS的理论解释要么依赖于严格假设，要么局限于特定的平方损失型目标函数。本文中，我们引入并研究了一种称为乘积稳定性（product-stability）的损失函数结构性质。我们证明，对于具有乘积稳定极小值的损失函数，即使训练处于EoS状态，应用于形如$(x,y) \mapsto l(xy)$目标的梯度下降法仍能可证明地收敛至局部极小值。该框架显著推广了先前结果，适用于包括二元交叉熵在内的广泛损失函数类别。通过分岔图分析，我们刻画了由此产生的训练动态，解释了稳定振荡现象的出现，并精确量化了收敛时的锐度。综合而言，我们的研究为更广泛损失函数类别下的稳定EoS训练提供了原理性解释。

摘要 (Abstract)

Empirically, modern deep learning training often occurs at the Edge of Stability (EoS), where the sharpness of the loss exceeds the threshold below which classical convergence analysis applies. Despite recent progress, existing theoretical explanations of EoS either rely on restrictive assumptions or focus on specific squared-loss-type objectives. In this work, we introduce and study a structural property of loss functions that we term product-stability. We show that for losses with product-stable minima, gradient descent applied to objectives of the form $(x,y) \mapsto l(xy)$ can provably converge to the local minimum even when training in the EoS regime. This framework substantially generalizes prior results and applies to a broad class of losses, including binary cross entropy. Using bifurcation diagrams, we characterize the resulting training dynamics, explain the emergence of stable oscillations, and precisely quantify the sharpness at convergence. Together, our results offer a principled explanation for stable EoS training for a wider class of loss functions.

关键词: gradient descent, Edge of Stability, convergence analysis, product-stability, loss functions, bifurcation diagrams, training dynamics, sharpness

231. ❌ Transfer Learning for Meta-analysis Under Covariate Shift

作者: Zilong Wang, Ali Abdeen, Turgay Ayer 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02656v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是统计学和医学研究领域的元分析方法，专注于处理随机对照试验中的协变量偏移问题，提出了一种基于安慰剂锚定的传输框架。论文内容完全围绕统计方法、因果推断和临床试验分析，没有涉及任何大语言模型、深度学习、AI技术或相关概念。所有关键词都针对大模型和深度学习技术及其应用，因此与这篇统计学论文完全无关。

!!! tip deepseek-chat TL;DR

该论文针对随机对照试验中协变量偏移导致标准元分析方法失效的问题，提出了一种基于安慰剂锚定的传输框架，在连接和非连接目标设置下都能有效估计患者层面的异质性治疗效果，并在实验中表现出优于基线方法的性能。

摘要翻译

随机对照试验通常无法代表决策所面向的总体，且研究间的协变量偏移可能使标准个体参与者数据（IPD）荟萃分析与迁移估计量失效。我们提出一种安慰剂锚定迁移框架，将源试验结局视为丰富的代理信号，而将目标试验安慰剂结局作为稀缺的高保真黄金标签，以校准基线风险。一种低复杂度（稀疏）校正方法将代理结局模型锚定至目标总体，并将锚定模型嵌入交叉拟合的双稳健学习器中，从而在可获得目标治疗组结局时，为患者层面的异质性治疗效果提供一个奈曼正交、目标位点双稳健的估计量。我们区分两种情形：在连接目标（设有治疗组）中，该方法可得到目标可识别的效应估计；在断开目标（仅含安慰剂组）中，该方法简化为一种在明确的工作模型迁移假设下有原则的筛选-然后-迁移流程。通过合成数据与半合成IHDP基准的实验，我们评估了点态条件平均处理效应（CATE）准确性、平均处理效应（ATE）误差、目标排序质量、决策理论策略遗憾以及校准度。在所有连接设置中，所提方法表现最佳或接近最佳，且在目标样本量较小时，相较于仅使用代理、仅使用目标以及迁移基线方法有显著提升；在断开设置中，该方法在目标排序方面保持强劲性能，而点态准确性则取决于工作迁移条件的强度。

摘要 (Abstract)

Randomized controlled trials often do not represent the populations where decisions are made, and covariate shift across studies can invalidate standard IPD meta-analysis and transport estimators. We propose a placebo-anchored transport framework that treats source-trial outcomes as abundant proxy signals and target-trial placebo outcomes as scarce, high-fidelity gold labels to calibrate baseline risk. A low-complexity (sparse) correction anchors proxy outcome models to the target population, and the anchored models are embedded in a cross-fitted doubly robust learner, yielding a Neyman-orthogonal, target-site doubly robust estimator for patient-level heterogeneous treatment effects when target treated outcomes are available. We distinguish two regimes: in connected targets (with a treated arm), the method yields target-identified effect estimates; in disconnected targets (placebo-only), it reduces to a principled screen–then–transport procedure under explicit working-model transport assumptions. Experiments on synthetic data and a semi-synthetic IHDP benchmark evaluate pointwise CATE accuracy, ATE error, ranking quality for targeting, decision-theoretic policy regret, and calibration. Across connected settings, the proposed method is best or near-best and improves substantially over proxy-only, target-only, and transport baselines at small target sample sizes; in disconnected settings, it retains strong ranking performance for targeting while pointwise accuracy depends on the strength of the working transport condition.

关键词: meta-analysis, covariate shift, transfer learning, randomized controlled trials, heterogeneous treatment effects, doubly robust estimator, placebo-anchored transport, CATE estimation

232. ❌ Conditional Sampling via Wasserstein Autoencoders and Triangular Transport

作者: Mohammad Al-Jarrah, Michele Martino, Marcus Yim, Bamdad Hosseini, Amirhossein Taghvaei 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02644v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是条件采样框架（Conditional Wasserstein Autoencoders），属于生成模型和最优传输理论领域，与所有评分关键词（均围绕大模型、深度学习技术原理及其应用）完全无关。论文未涉及任何大模型技术、训练方法、推理优化、对齐、应用场景等关键词内容。

!!! tip deepseek-chat TL;DR

该论文提出了条件Wasserstein自编码器（CWAEs）框架，用于条件模拟，通过利用条件和被条件变量的低维结构，在低维支撑的条件测度问题上显著降低了近似误差。

摘要翻译

本文提出条件Wasserstein自编码器（CWAEs），这是一种用于条件模拟的框架，旨在同时利用被条件变量与条件变量的低维结构。其核心思想是改进Wasserstein自编码器，使其采用（分块）三角解码器结构，并对隐变量施加适当的独立性假设。我们证明，所得模型能够构建一种既能利用低维结构、其解码器又可用于条件模拟的自编码器。我们探讨了CWAEs的多项理论性质，包括其与条件最优传输（Conditional Optimal Transport, OT）问题的关联。同时，我们提出了多种替代形式，由此衍生出构成算法基础的三种架构变体。通过一系列数值实验，我们证明相较于低秩集合卡尔曼滤波（Low-Rank Ensemble Kalman Filter, LREnKF），不同的CWAE变体均能显著降低近似误差，在条件测度支撑集确实为低维的问题中效果尤为突出。

摘要 (Abstract)

We present Conditional Wasserstein Autoencoders (CWAEs), a framework for conditional simulation that exploits low-dimensional structure in both the conditioned and the conditioning variables. The key idea is to modify a Wasserstein autoencoder to use a (block-) triangular decoder and impose an appropriate independence assumption on the latent variables. We show that the resulting model gives an autoencoder that can exploit low-dimensional structure while simultaneously the decoder can be used for conditional simulation. We explore various theoretical properties of CWAEs, including their connections to conditional optimal transport (OT) problems. We also present alternative formulations that lead to three architectural variants forming the foundation of our algorithms. We present a series of numerical experiments that demonstrate that our different CWAE variants achieve substantial reductions in approximation error relative to the low-rank ensemble Kalman filter (LREnKF), particularly in problems where the support of the conditional measures is truly low-dimensional.

关键词: Conditional Wasserstein Autoencoders, conditional simulation, low-dimensional structure, optimal transport, decoder, latent variables, approximation error, ensemble Kalman filter

233. ❌ Complex-Valued GNNs for Distributed Basis-Invariant Control of Planar Systems

作者: Samuel Honor, Mohamed Abdelnaby, Kevin Leahy 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02615v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是图神经网络（GNN）在分布式控制系统中的应用，特别是通过复值表示实现局部坐标系不变性。所有评分关键词都直接与大语言模型（LLM）或深度学习在科学领域的特定应用（如生物信息学）相关，而本文专注于GNN在机器人控制/多智能体系统中的应用，属于传统深度学习范畴，未涉及LLM、MoE、量化、对齐、推理加速等大模型核心技术或AI for Science的特定领域。因此，所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种复值图神经网络架构，用于实现分布式平面系统的局部坐标系不变控制，在模仿学习编队任务中相比实值基线提高了数据效率、跟踪性能和泛化能力。

摘要翻译

图神经网络因其可分布式部署的特性，已成为学习控制网络化动态系统的常用工具。然而，现有的分布式图神经网络架构假设网络中所有节点均在兼容的基坐标系下采集几何观测数据，这限制了此类控制器在无GPS与无罗盘环境中的实用性。本文提出一种对局部基坐标系选择具有全局不变性的图神经网络参数化方法。该方法将二维几何特征及基坐标系间的变换在复数域中表达。在每个图神经网络层内部，采用具有相位等变性激活函数的复数值线性层。当从固定的全局坐标系视角观察时，该架构学习的所有控制策略对局部坐标系的选择严格保持不变。通过模仿学习集群控制任务的实验表明，与实数基线模型相比，该架构显著提升了数据效率、跟踪性能以及学习控制的泛化能力。

摘要 (Abstract)

Graph neural networks (GNNs) are a well-regarded tool for learned control of networked dynamical systems due to their ability to be deployed in a distributed manner. However, current distributed GNN architectures assume that all nodes in the network collect geometric observations in compatible bases, which limits the usefulness of such controllers in GPS-denied and compass-denied environments. This paper presents a GNN parametrization that is globally invariant to choice of local basis. 2D geometric features and transformations between bases are expressed in the complex domain. Inside each GNN layer, complex-valued linear layers with phase-equivariant activation functions are used. When viewed from a fixed global frame, all policies learned by this architecture are strictly invariant to choice of local frames. This architecture is shown to increase the data efficiency, tracking performance, and generalization of learned control when compared to a real-valued baseline on an imitation learning flocking task.

关键词: Graph Neural Networks, Distributed Control, Basis Invariance, Complex-valued Networks, Flocking, Imitation Learning, Planar Systems

234. ❌ AXELRAM: Quantize Once, Never Dequantize

作者: Yasushi Nishida 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02638v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	10.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究KV cache的量化压缩和推理加速技术，与’Large Language Models’、‘KV Cache Compression’、‘Quantization’、‘Speculative Decoding’高度相关（10分）。论文涉及LLaMA、Qwen等大模型，并直接处理KV cache的量化问题以实现102.4倍的乘法减少，属于推理加速范畴。其他关键词如MoE、SFT、RAG、AI for Science等与论文硬件架构和量化方法无直接关联，故评0分。

!!! tip deepseek-chat TL;DR

论文提出AXELRAM硬件架构，通过设计时固定码本和正交变换量化，实现无需反量化的KV cache注意力计算，减少102.4倍乘法操作，并发现并解决了量化导致的灾难性PPL尖峰问题。

摘要翻译

我们提出AXELRAM，一种智能SRAM宏架构，能够直接从量化后的KV缓存索引计算注意力分数，而无需反量化操作。其关键实现在于一个设计时固定的码本：基于正交变换的量化方法将每个坐标的分布集中到N(0,1/d)，因此最优量化器仅取决于维度d和比特宽度b，而与输入数据无关。非对称路径设计——写入时进行变换，读取时通过查表操作完成（无需逆变换）——将每次查询的乘法运算量减少了102.4倍（基于数学恒等式）。
通过多种子评估（10个种子×3个模型），我们发现符号模式敏感性会在某些模型（如Qwen2.5-3B）上引发灾难性的困惑度（PPL）尖峰（变化量Delta > 50），而其他模型（如LLaMA-3.1-8B）则完全稳定。这一现象将SpinQuant在权重量化中观察到的旋转方差问题扩展到了KV缓存领域，且其影响在性质上更为严重。我们将根本原因追溯至层间范数异质性，并提出了一种无需梯度的符号模式选择方法（从200个候选模式中，使用8个校准样本进行一次性选择），该方法能以零额外硬件成本消除灾难性尖峰。所有源代码已在https://github.com/Axelidea/AXELRAM公开。

摘要 (Abstract)

We propose AXELRAM, a smart SRAM macro architecture that computes attention scores directly from quantized KV cache indices without dequantization. The key enabler is a design-time fixed codebook: orthogonal-transform-based quantization concentrates each coordinate’s distribution to N(0,1/d), so the optimal quantizer depends only on dimension d and bit-width b, not on input data. The asymmetric path design – transform on write, table-lookup on read with no inverse transform – reduces per-query multiplications by 102.4x (a mathematical identity). Through multi-seed evaluation (10 seeds x 3 models), we discover that sign pattern sensitivity causes catastrophic PPL spikes (Delta > 50) on certain models (Qwen2.5-3B), while others (LLaMA-3.1-8B) are fully stable. This phenomenon extends SpinQuant’s observation of rotation variance in weight quantization to the KV cache domain, where the effect is qualitatively more severe. We trace the root cause to layer-wise norm heterogeneity and propose a gradient-free sign pattern selection (200 candidates, 8 calibration samples, one-time) that eliminates catastrophic spikes with zero additional hardware cost. All source code is available at https://github.com/Axelidea/AXELRAM.

关键词: KV cache, quantization, attention computation, inference acceleration, hardware architecture, orthogonal transform, catastrophic PPL spikes, sign pattern sensitivity

235. ❌ Structure-Preserving Multi-View Embedding Using Gromov-Wasserstein Optimal Transport

作者: Rafael Pereira Eufrazio, Eduardo Fernandes Montesuma, Charles Casimiro Cavalcante 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02610v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究多视图数据分析和Gromov-Wasserstein最优传输在嵌入学习中的应用，属于传统机器学习/表示学习领域。所有评分关键词均针对大模型、深度学习技术及其应用（如LLM、MoE、RLHF、RAG、量化等），而本文完全不涉及这些内容，也未提及任何AI for Science的具体应用（如生物信息学、化学信息学）。因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了两种基于Gromov-Wasserstein最优传输的几何感知多视图嵌入方法，用于在异构几何或非线性扭曲下有效保留多视图数据的内在关系结构。

摘要翻译

多视图数据分析旨在整合同一样本的多种表征，以恢复一致的低维结构。传统方法通常依赖于特征拼接或显式对齐假设，这些方法在异构几何或非线性畸变下具有局限性。本研究基于格罗莫夫-瓦瑟斯坦最优传输理论，提出了两种几何感知的多视图嵌入策略。第一种方法称为均值-GW多维标度法，通过平均各视图的距离矩阵并应用基于GW的多维标度分析来获取代表性嵌入。第二种策略称为多重-GW多维标度法，采用基于选择的范式：通过基于GW的对齐生成多个几何一致的候选嵌入，并从中选取代表性嵌入。在合成流形和真实数据集上的实验表明，所提方法能有效保持跨视图的内在关联结构。这些结果凸显了基于GW的方法可作为多视图表征学习中灵活且具有理论依据的框架。

摘要 (Abstract)

Multi-view data analysis seeks to integrate multiple representations of the same samples in order to recover a coherent low-dimensional structure. Classical approaches often rely on feature concatenation or explicit alignment assumptions, which become restrictive under heterogeneous geometries or nonlinear distortions. In this work, we propose two geometry-aware multi-view embedding strategies grounded in Gromov-Wasserstein (GW) optimal transport. The first, termed Mean-GWMDS, aggregates view-specific relational information by averaging distance matrices and applying GW-based multidimensional scaling to obtain a representative embedding. The second strategy, referred to as Multi-GWMDS, adopts a selection-based paradigm in which multiple geometry-consistent candidate embeddings are generated via GW-based alignment and a representative embedding is selected. Experiments on synthetic manifolds and real-world datasets show that the proposed methods effectively preserve intrinsic relational structure across views. These results highlight GW-based approaches as a flexible and principled framework for multi-view representation learning.

关键词: Multi-view data analysis, Gromov-Wasserstein optimal transport, Geometry-aware embedding, Relational structure preservation, Multidimensional scaling, Heterogeneous geometries, Representation learning, Distance matrices

236. ❌ Steerable but Not Decodable: Function Vectors Operate Beyond the Logit Lens

作者: Mohammed Suhail B Nadaf 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02608v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	10.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究Function Vectors（FVs）在大型语言模型中的工作机制，属于大模型技术原理创新。高度相关关键词：1）‘Large Language Models’（论文研究Llama-3.1-8B、Gemma-2-9B、Mistral-7B-v0.3等模型）；2）‘Mechanistic Interpretability’（研究FVs如何通过残差流干预模型行为，属于可解释性研究）；3）‘In-context Learning’（FVs提取自上下文学习演示）。中等相关：‘Instruction Tuning’（论文测试了基础模型和指令调优模型）。其他关键词如MoE、量化、RAG等未涉及。

!!! tip deepseek-chat TL;DR

该论文研究发现，从上下文学习演示中提取的函数向量（FVs）能够通过残差流干预成功引导大语言模型行为，即使logit lens无法解码正确答案，揭示了FVs编码的是计算指令而非答案方向，并发现了不同模型家族在干预机制上的差异。

摘要翻译

功能向量（Function vectors, FVs）——从上下文学习示例中提取的均值差分方向——当被添加到残差流中时，能够引导大语言模型的行为。我们曾假设，FV引导失败反映了任务相关信息的缺失：logit透镜会与引导同时失效。但我们错了。在迄今为止最全面的跨模板FV迁移研究中——涵盖12个任务、来自3个模型家族（Llama-3.1-8B、Gemma-2-9B、Mistral-7B-v0.3；包括基础版和指令调优版）的6个模型，每个任务使用8个模板，共计4,032对组合——我们发现了相反的解离现象：即使logit透镜在任何层都无法解码出正确答案，FV引导依然能够成功。这种“可引导但不可解码”的模式是普遍存在的：在所有模型的所有任务上，引导的准确率均超过logit透镜，差距最大可达-0.91。仅在72个任务-模型实例中的3个（均出现在Mistral模型中）观察到了预测的“可解码但不可引导”模式。FV词汇投影显示，即使达到超过0.90引导准确率的FV，其投影出的词元分布依然不连贯，这表明FV编码的是计算指令而非答案方向。FV在早期层（L2-L8）干预效果最佳；而logit透镜仅在晚期层（L28-L32）才能检测到正确答案。先前报道的负余弦迁移相关性（r=-0.572）在大规模分析中消失：汇总后的r值范围在-0.199至+0.126之间，并且余弦相似度在任务身份信息之外对R平方的贡献不足0.011。引导后分析揭示了模型家族间的差异：Mistral的FV重写了中间表示；而Llama/Gemma的FV尽管引导成功，却产生了近乎零的变化。激活修补确认了因果定位：简单任务在目标层实现了完美恢复；困难任务在所有层均显示为零恢复。

摘要 (Abstract)

Function vectors (FVs) – mean-difference directions extracted from in-context learning demonstrations – can steer large language model behavior when added to the residual stream. We hypothesized that FV steering failures reflect an absence of task-relevant information: the logit lens would fail alongside steering. We were wrong. In the most comprehensive cross-template FV transfer study to date - 4,032 pairs across 12 tasks, 6 models from 3 families (Llama-3.1-8B, Gemma-2-9B, Mistral-7B-v0.3; base and instruction-tuned), 8 templates per task - we find the opposite dissociation: FV steering succeeds even when the logit lens cannot decode the correct answer at any layer. This steerability-without-decodability pattern is universal: steering exceeds logit lens accuracy for every task on every model, with gaps as large as -0.91. Only 3 of 72 task-model instances show the predicted decodable-without-steerable pattern, all in Mistral. FV vocabulary projection reveals that FVs achieving over 0.90 steering accuracy still project to incoherent token distributions, indicating FVs encode computational instructions rather than answer directions. FVs intervene optimally at early layers (L2-L8); the logit lens detects correct answers only at late layers (L28-L32). The previously reported negative cosine-transfer correlation (r=-0.572) dissolves at scale: pooled r ranges from -0.199 to +0.126, and cosine adds less than 0.011 in R-squared beyond task identity. Post-steering analysis reveals a model-family divergence: Mistral FVs rewrite intermediate representations; Llama/Gemma FVs produce near-zero changes despite successful steering. Activation patching confirms causal localization: easy tasks achieve perfect recovery at targeted layers; hard tasks show zero recovery everywhere.

关键词: Function Vectors, Large Language Models, In-context Learning, Logit Lens, Residual Stream, Steering, Mechanistic Interpretability, Model Behavior

237. ❌ WGFINNs: Weak formulation-based GENERIC formalism informed neural networks’

作者: Jun Sur Richard Park, Auroni Huque Hashim, Siu Wun Cheung, Youngsoo Choi, Yeonjong Shin 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02601v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文WGFINNs专注于科学机器学习中从噪声数据发现控制方程的特定方法，提出了一种基于弱形式的GENERIC形式主义神经网络。所有关键词（除了’AI for Science’）都直接涉及大语言模型（LLMs）或深度学习技术原理的创新，而该论文的核心是物理信息神经网络（PINNs）的变体，用于解决偏微分方程，不涉及LLMs、MoE、SLMs、缩放定律、预训练、后训练、对齐、RLHF、PEFT、RAG、上下文扩展、注意力优化、推理技术、代理系统、量化、解码加速、幻觉缓解、可解释性、世界模型、模型合并、上下文学习等主题。因此，仅’AI for Science’相关，因为论文属于科学AI应用（科学机器学习），但并非核心LLM技术，故给5分（有一定关联）。其他关键词完全无关，给0分。

!!! tip deepseek-chat TL;DR

该论文解决了从噪声观测数据中数据驱动发现控制方程的挑战，通过提出弱形式GENERIC形式主义神经网络（WGFINNs），在保留热力学定律的同时显著提高了对噪声数据的鲁棒性，并实现了更准确的预测和物理量恢复。

摘要翻译

从含噪观测数据中驱动发现控制方程，始终是科学机器学习领域的核心挑战。虽然基于 GENERIC 形式主义的神经网络（GFINNs）提供了一个原则性框架，通过其构造强制满足热力学定律，但其对强形式损失函数的依赖使其对测量噪声极为敏感。为克服这一局限，我们提出了基于弱形式的 GENERIC 形式主义神经网络（WGFINNs），该模型将动力系统的弱形式与 GFINNs 的结构保持架构相结合。WGFINNs 在保持精确满足 GENERIC 退化条件和对称条件的同时，显著增强了对含噪数据的鲁棒性。我们进一步引入了状态变量加权损失函数和基于残差的注意力机制，以缓解不同状态变量间的尺度不平衡问题。理论分析对比了强形式估计量与弱形式估计量之间的定量差异：主要在于，当存在噪声时，强形式估计量会随着时间步长减小而发散；而只要测试函数满足特定条件，弱形式估计量即使在含噪数据下也能保持准确。数值实验表明，在不同噪声水平下，WGFINNs 的表现始终优于 GFINNs，能实现更准确的预测和更可靠的物理量恢复。

摘要 (Abstract)

Data-driven discovery of governing equations from noisy observations remains a fundamental challenge in scientific machine learning. While GENERIC formalism informed neural networks (GFINNs) provide a principled framework that enforces the laws of thermodynamics by construction, their reliance on strong-form loss formulations makes them highly sensitive to measurement noise. To address this limitation, we propose weak formulation-based GENERIC formalism informed neural networks (WGFINNs), which integrate the weak formulation of dynamical systems with the structure-preserving architecture of GFINNs. WGFINNs significantly enhance robustness to noisy data while retaining exact satisfaction of GENERIC degeneracy and symmetry conditions. We further incorporate a state-wise weighted loss and a residual-based attention mechanism to mitigate scale imbalance across state variables. Theoretical analysis contrasts quantitative differences between the strong-form and the weak-form estimators. Mainly, the strong-form estimator diverges as the time step decreases in the presence of noise, while the weak-form estimator can be accurate even with noisy data if test functions satisfy certain conditions. Numerical experiments demonstrate that WGFINNs consistently outperform GFINNs at varying noise levels, achieving more accurate predictions and reliable recovery of physical quantities.

关键词: GENERIC formalism, weak formulation, neural networks, scientific machine learning, noisy data, thermodynamics, dynamical systems, physical quantities

238. ❌ High-dimensional Many-to-many-to-many Mediation Analysis

作者: Tien Dat Nguyen, Trung Khang Tran, Cong Khanh Truong, Duy-Cat Can, Binh T. Nguyen, Oliver Y. Chén 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.02886v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究高维多对多对多中介分析（MMM）的统计方法，应用于阿尔茨海默病神经影像学数据，分析基因-神经-认知通路。所有关键词均与大模型、深度学习技术原理或应用无关，仅最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’因论文涉及生物信息学数据分析（基因组和神经影像数据）而获得5分（有一定关联），但论文核心是传统统计方法而非AI技术。

!!! tip deepseek-chat TL;DR

该论文提出了一种高维多对多对多中介分析（MMM）的统计方法，用于研究基因、脑区和认知结果之间的复杂通路，并在阿尔茨海默病数据中验证了其有效性和生物可解释性。

摘要翻译

本研究探讨高维中介分析问题，其中暴露变量、中介变量与结局变量均为多元变量，且暴露变量与中介变量均可能具有高维特性。我们将此问题形式化为一个多（暴露）-对-多（中介）-对-多（结局）（MMM）的中介分析框架。在方法学上，MMM中介分析能够同时对高维暴露变量与中介变量进行变量选择，估计间接效应矩阵（即连接暴露-中介与中介-结局通路的系数矩阵），并实现多元结局的预测。在理论上，我们证明了所估计的间接效应矩阵具有一致性及元素级渐近正态性，并推导了估计量的误差界。为评估MMM中介框架的有效性，我们首先通过模拟研究考察其有限样本性能，包括收敛性质、渐近近似的表现以及对噪声的稳健性。随后，我们将MMM中介分析应用于阿尔茨海默病神经影像学倡议（Alzheimer’s Disease Neuroimaging Initiative）的数据，研究202个脑区的皮质厚度如何中介688个全基因组显著单核苷酸多态性（single nucleotide polymorphisms, SNPs）（从约150万个SNPs中筛选得出）对十一个认知行为与诊断结局的影响。MMM中介框架识别出具有生物学可解释性的多对多对多遗传-神经-认知通路，并提升了后续样本外分类与预测的性能。综上所述，我们的结果证明了MMM中介分析的潜力，并凸显了统计方法在科学研究中探索复杂高维多层级通路的价值。MMM软件包可在https://github.com/THELabTop/MMM-Mediation获取。

摘要 (Abstract)

We study high-dimensional mediation analysis in which exposures, mediators, and outcomes are all multivariate, and both exposures and mediators may be high-dimensional. We formalize this as a many (exposures)-to-many (mediators)-to-many (outcomes) (MMM) mediation analysis problem. Methodologically, MMM mediation analysis simultaneously performs variable selection for high-dimensional exposures and mediators, estimates the indirect effect matrix (i.e., the coefficient matrices linking exposure-to-mediator and mediator-to-outcome pathways), and enables prediction of multivariate outcomes. Theoretically, we show that the estimated indirect effect matrices are consistent and element-wise asymptotically normal, and we derive error bounds for the estimators. To evaluate the efficacy of the MMM mediation framework, we first investigate its finite-sample performance, including convergence properties, the behavior of the asymptotic approximations, and robustness to noise, via simulation studies. We then apply MMM mediation analysis to data from the Alzheimer’s Disease Neuroimaging Initiative to study how cortical thickness of 202 brain regions may mediate the effects of 688 genome-wide significant single nucleotide polymorphisms (SNPs) (selected from approximately 1.5 million SNPs) on eleven cognitive-behavioral and diagnostic outcomes. The MMM mediation framework identifies biologically interpretable, many-to-many-to-many genetic-neural-cognitive pathways and improves downstream out-of-sample classification and prediction performance. Taken together, our results demonstrate the potential of MMM mediation analysis and highlight the value of statistical methodology for investigating complex, high-dimensional multi-layer pathways in science. The MMM package is available at https://github.com/THELabTop/MMM-Mediation.

关键词: mediation analysis, high-dimensional, many-to-many-to-many, statistical methodology, Alzheimer’s Disease, genetic-neural-cognitive pathways, variable selection, indirect effect matrix

239. ❌ Low-Scaling Many-Body Green’s Function Calculations for Molecular Systems via Interacting-Bath Dynamical Embedding Theory

作者: Christian Venturella, Jiachen Li, Tianyu Zhu 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03137v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于计算化学领域，提出了一种用于分子系统格林函数计算的嵌入方法（ibDET），旨在高效计算带电激发能。论文内容涉及量子化学方法（如GW和EOM-CCSD）、嵌入理论、以及分子系统的光谱性质计算，属于计算化学和物理化学范畴。所有关键词均与大模型、深度学习、AI技术原理或应用直接相关，而本文未涉及任何大模型、深度学习或AI技术，仅最后一个关键词“AI for Science OR Bioinformatics OR Cheminformatics”与科学计算有一定广义关联，但论文并未使用AI方法，而是传统量子化学计算，因此给予5分（有一定关联）。其他关键词完全无关，均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为相互作用浴动态嵌入理论（ibDET）的分子格林函数嵌入方法，用于高效且可扩展地计算分子系统的带电激发能和光谱性质，在保持高精度（误差约0.1 eV或更小）的同时大幅降低了计算成本。

摘要翻译

本文对我们近期提出的格林函数嵌入方法——相互作用浴动态嵌入理论（ibDET）进行了分子体系扩展，用于在$GW$和EOM-CCSD级别计算带电激发能。该方法以原子中心杂质为起点，构建能够捕捉杂质与其环境之间频率依赖纠缠的浴表示，并可通过构建簇特异性自然轨道（cluster-specific natural orbitals）进行系统性改进。利用$GW$或耦合簇格林函数求解器，从所有嵌入问题中组装完整体系的自能量，从而获得相互作用的格林函数。研究表明，对于包括共轭分子和纳米团簇在内的多种体系，ibDET能以大幅降低的计算成本提供精确的光谱性质。与全体系计算结果相比，该方法预测的电离势和电子亲和能的误差约为0.1 eV或更小，而每个嵌入问题仅包含总轨道空间的一小部分。这项工作为计算分子体系的光谱性质提供了一个高效且可扩展的框架。

摘要 (Abstract)

We present a molecular extension of our recently proposed Green’s function embedding method, interacting-bath dynamical embedding theory (ibDET), for computing charged excitation energies at the $GW$ and EOM-CCSD levels. Starting from atom-centered impurities, we construct bath representations that capture the frequency-dependent entanglement between the impurity and its environment and can be systematically improved via the construction of cluster-specific natural orbitals. Utilizing a $GW$ or coupled-cluster Green’s function solver, the self-energy of the full system is assembled from all embedding problems to obtain the interacting Green’s function. We show that ibDET provides accurate spectral properties with much reduced cost for a broad range of systems, including conjugated molecules and nanoclusters. Compared with full-system results, the errors in the predicted ionization potentials and electron affinities are around 0.1 eV or smaller, while each embedding problem includes only a small fraction of the total orbital space. This work provides an efficient and scalable framework for computing spectral properties of molecular systems.

关键词: Green’s function embedding, interacting-bath dynamical embedding theory, molecular systems, charged excitation energies, GW method, EOM-CCSD, spectral properties, scalable framework

240. ❌ Dataset Distillation for Machine Learning Force Field in Phase Transition Regime

作者: Ruiyang Chen, Qingyuan Zhang, Ji Chen 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03027v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于机器学习力场（MLFF）的数据集蒸馏方法，属于AI for Science（科学AI）领域，具体应用于原子模拟和相变研究。论文未涉及任何大语言模型（LLM）、深度学习技术原理创新或大模型在不同领域的应用，仅与“AI for Science”关键词有中等关联（5分），因为它是科学计算中的AI应用，但未涉及生物信息学或化学信息学。其他所有关键词均完全无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种用于机器学习力场训练的数据集蒸馏算法（CPD），解决了相变区域训练效率低的问题，并在液态氢的液-液相变中验证了仅需200个配置即可训练出高保真MLFF。

摘要翻译

机器学习力场（Machine Learning Force Field, MLFF）已成为原子尺度模拟中一种强大的数据驱动工具，能够以媲美\textit{ab initio}（第一性原理）方法的精度模拟大规模复杂原子体系。然而，在相变区域，由于结构涨落显著增强，MLFF 的训练效率往往较低。为应对这一挑战，我们提出了一种用于训练数据集蒸馏的中心-外围蒸馏算法。该算法通过策略性地整合代表性样本与关键边界案例，确保蒸馏后的数据集保持最大的结构多样性。我们在稠密氢的液-液相变上验证了 CPD 方法的有效性。结果表明，采用 CPD 方法，仅需 200 个构型即可训练出一个能够完整复现相变区域附近液氢结构性质与动力学性质的 MLFF。这项工作为 MLFF 训练数据集的高保真标注（例如采用超越标准密度泛函理论的高阶\textit{ab initio}计算）铺平了道路，从而提升了 MLFF 的预测精度。

摘要 (Abstract)

Machine learning force field (MLFF) has emerged as a powerful data-driven tool for atomistic simulations, enabling large-scale and complex atomic systems to be simulated with accuracy comparable to \textit{ab initio} methods. However, MLFFs often suffer from low training efficiency in the phase transition regime, where structural fluctuations are significantly elevated. To address this challenge, we propose a Central-Peripheral Distillation (CPD) algorithm for training dataset distillation. By strategically integrating representative samples with critical corner cases, the CPD algorithm ensures that the distilled dataset retains maximum structural diversity. We validated the efficacy of the CPD method on the liquid-liquid phase transition of dense hydrogen. Results show that, with the CPD approach, only 200 configurations are sufficient to train a MLFF that can fully reproduce the structural and dynamical properties of liquid hydrogen in the vicinity of its phase transition regime. This work paves the way for high-fidelity labeling of the MLFF training datasets, for instance by adopting high-level \textit{ab initio} calculations beyond the standard density functional theory, thereby enhancing the predictive accuracy of MLFFs.

关键词: Dataset Distillation, Machine Learning Force Field, Phase Transition, Central-Peripheral Distillation, Liquid-Liquid Phase Transition, Atomic Simulations, Training Efficiency, Structural Diversity

241. ❌ CO and N2 Produced from H2O, CO2, and NH3 Cometary Ice Analogs

作者: Alexandra McKinnon, Alexia Simon, Michelle R. Brann, Elettra L. Piacentino, Karin I. Oberg, Mahesh Rajappan 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03207v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究彗星冰模拟物中CO和N2的光化学形成机制，属于天体化学和行星科学领域，与所有评分关键词（均涉及大模型、深度学习及相关技术）完全无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该研究通过实验模拟彗星冰在紫外线和电子轰击下的光化学反应，发现N2主要来自NH3的光解，而CO的丰度则更多反映彗星形成时的低温捕获过程。

摘要翻译

在彗星中已检测到一氧化碳（CO）和分子氮（N2）等高挥发性物质，若其存在源于冻结和/或捕获过程，则可用于约束彗星形成的温度条件。本文则探讨了彗星高挥发性物质的另一种可能起源：低挥发性物质的光解作用。我们表征了在紫外（UV）辐照和电子轰击条件下，对二氧化碳（CO2）、氨（NH3）、H2O:CO2、H2O:NH3及H2O:CO2:NH3等彗星冰模拟物进行处理后CO和N2的形成情况。研究发现，在10 K至100 K的温度范围内，所有经过光处理的冰中均会形成CO和N2，其相对于水的生成比例分别为0.4-0.9%和0.03-0.7%，且在整个实验过程中，CO/CO2和N2/NH3的混合比分别为2.5-62%和0.7-9%。由于我们初始的冰成分与星际冰合理匹配，且使用的紫外辐照条件类似于暗云环境，因此可将所得比例直接与彗星丰度进行比较。此类比较表明，虽然彗星中仅少数CO观测结果可直接由光解作用解释，但几乎所有观测到的彗星N2均可通过水冰中嵌入的NH3光解作用得到说明。后一结果也与在67P彗星中观测到的N2和NH3同位素比值同样升高的情况相符。综上所述，我们的研究表明，在推断彗星形成位置时，应谨慎使用低于1%的N2/H2O比值，而许多彗星中观测到的更显著的CO丰度则很可能暗示了其在低温冰环境中的捕获过程。

摘要 (Abstract)

Hypervolatile species such as carbon monoxide (CO) and molecular nitrogen (N2) have been detected in comets, and could be used to constrain comet formation temperature conditions if their presence is due to freeze-out and/or entrapment. Here we instead explore another plausible origin of cometary hypervolatiles: photodissociation of less volatile species. We characterize CO and N2 formation following ultraviolet (UV) irradiation and electron bombardment of carbon dioxide (CO2), ammonia (NH3), H2O:CO2, H2O:NH3, and H2O:CO2:NH3 cometary ice analogs. We find that CO and N2 form in all photoprocessed ices at temperatures between 10 K and 100 K, resulting in 0.4-0.9 % CO and 0.03-0.7 % N2 relative to water, and CO/CO2 and N2/NH3 mixing ratios of 2.5-62 % and 0.7-9 %, respectively, across the experiments. Because our initial ices are reasonably well-matched to interstellar ices and we use UV exposure similar to a dark cloud, we can compare the resulting ratios directly to cometary abundances. Such a comparison shows that while only a few of CO observations in comets are readily explained by photodissociation, almost all observed cometary N2 can be accounted for by photodissociation of NH3 embedded in water ice. The latter result is also consistent with observed similarly elevated isotopic ratios of N2 and NH3 in 67P. Taken together, our results suggest that N2/H2O ratios less than 1 % should be used cautiously when inferring a comet’s formation location, while the more substantial CO abundances seen in many comets do likely imply entrapment at low ice temperatures.

关键词: cometary ice analogs, photodissociation, carbon monoxide, molecular nitrogen, ultraviolet irradiation, ammonia, carbon dioxide, formation temperature

242. ❌ Regio-Connectivity and Torsional Angle Effects on Singlet Fission and SOCT-ISC in Aza-BODIPY Dimers

作者: Sophiya Goyal, S. Rajagopala Reddy 期刊/来源: arxiv 发布日期: 2026-04-03 arXiv链接: http://arxiv.org/abs/2604.03011v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究的是化学领域的aza-BODIPY二聚体分子，通过量子化学计算研究分子几何结构对单线态裂变和自旋轨道电荷转移系间窜越的影响。论文内容完全属于计算化学和光物理领域，与所有大模型、深度学习、AI技术相关的关键词均无直接关联。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，但论文使用的是传统的量子化学计算方法（MP2、SA-XMCQDPT），而非AI/机器学习方法，因此只能给予5分（有一定关联），表示属于科学计算领域但未使用AI技术。

!!! tip deepseek-chat TL;DR

该研究通过量子化学计算揭示了aza-BODIPY二聚体的分子几何结构（特别是扭转角）对单线态裂变和自旋轨道电荷转移系间窜越这两种三重态生成机制的调控作用。

摘要翻译

氮杂-BODIPY二聚体是通过分子内单线态裂变（iSF）或自旋轨道电荷转移系间窜越（SOCT-ISC）实现高效三线态生成的有前景的分子体系。本研究利用多参考量子化学计算，在四种区域异构的氮杂-BODIPY二聚体（D[1,1]、D[1,3]、D[3,3]和D[2,2]）中探究了分子几何结构在调控这些机制中的作用。我们在MP2和SA-XMCQDPT理论水平上分析了基态和激发态性质，并通过评估绝热耦合与自旋轨道矩阵元分别估算了iSF和SOCT-ISC的速率常数。结果表明，三线态的形成主要受单体单元间扭转角（Φ）的强烈调控，而区域连接性仅产生次要影响。二聚体D[1,1]和D[1,3]表现出有利的iSF能量学与耦合强度，而D[2,2]则显示出较低的iSF速率常数（kSF），但其SOCT-ISC活性增强。D[3,3]二聚体虽能发生放热的多激子形成过程，但由于相消耦合作用导致iSF效率降低。主导的ISC通道通过S1-T3跃迁进行，该过程具有较大的自旋轨道耦合和较小的能隙。这些发现为理解氮杂-BODIPY二聚体中几何结构依赖的三线态生成机制提供了关键的理论依据。

摘要 (Abstract)

Aza-BODIPY dimers represent promising molecular systems for efficient triplet-state generation through either intramolecular-singlet fission (iSF) or spin-orbit charge transfer intersystem crossing (SOCT-ISC). In this work, we investigate the role of molecular geometry in governing these mechanisms across four regioisomeric aza-BODIPY dimers (D[1,1], D[1,3], D[3,3], and D[2,2]) using multireference quantum-chemical calculations. Ground- and excited-state properties were analyzed at the MP2 and SA-XMCQDPT levels of theory, while diabatic couplings and spin-orbit matrix elements were evaluated to estimate iSF and SOCT-ISC rate constants, respectively. Our results reveal that triplet formation is strongly governed by the torsional angle (Φ) between monomer units, with regio-connectivity exerting a secondary influence. Dimers D[1,1] and D[1,3] exhibit favorable iSF energetics and coupling magnitudes, whereas D[2,2] displays low iSF rate constant (kSF ) but enhanced SOCT-ISC activity. The D[3,3] dimer shows exothermic multiexciton formation but reduced iSF efficiency due to destructive coupling interactions. The dominant ISC channel proceeds through the S1-T3 transition with large spin-orbit coupling and a small energy gap. These findings provide critical mechanistic insights into geometry-dependent triplet generation in aza-BODIPY dimers.

关键词: aza-BODIPY dimers, singlet fission, SOCT-ISC, torsional angle, regio-connectivity, quantum-chemical calculations, triplet-state generation, molecular geometry

Token 消耗统计

总计: 783,738 tokens（输入 541,441 / 输出 242,297）

模型	输入	输出	合计
deepseek-chat	435,832	242,297	678,129
glm-4.7	105,609	0	105,609

📊 ArXiv 研究报告 (2026-04-07)

📌 配置信息

关键词列表（共 27 个，总权重 27.0）

评分设置

📈 论文统计

⭐ 及格论文详细分析

1. AutoVerifier: An Agentic Automated Verification Framework Using Large Language Models

2. Train Yourself as an LLM: Exploring Effects of AI Literacy on Persuasion via Role-playing LLM Traini

3. R2-Write: Reflection and Revision for Open-Ended Writing with Deep Reasoning

4. Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attribute

5. FoE: Forest of Errors Makes the First Solution the Best in Large Reasoning Models

6. Overcoming the “Impracticality” of RAG: Proposing a Real-World Benchmark and Multi-Dimensional Diagn

7. Improving Role Consistency in Multi-Agent Collaboration via Quantitative Role Clarity

8. Supply-Chain Poisoning Attacks Against LLM Coding Agent Skill Ecosystems

9. Rubrics to Tokens: Bridging Response-level Rubrics and Token-level Rewards in Instruction Following

📋 所有论文列表

1. ✅ AutoVerifier: An Agentic Automated Verification Framework Using Large Language Models

2. ✅ Train Yourself as an LLM: Exploring Effects of AI Literacy on Persuasion via Role-playing LLM Training

3. ✅ R2-Write: Reflection and Revision for Open-Ended Writing with Deep Reasoning

4. ✅ Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR

5. ✅ FoE: Forest of Errors Makes the First Solution the Best in Large Reasoning Models

6. ✅ Overcoming the “Impracticality” of RAG: Proposing a Real-World Benchmark and Multi-Dimensional Diagnostic Framework

7. ✅ Improving Role Consistency in Multi-Agent Collaboration via Quantitative Role Clarity

8. ✅ Supply-Chain Poisoning Attacks Against LLM Coding Agent Skill Ecosystems

9. ✅ Rubrics to Tokens: Bridging Response-level Rubrics and Token-level Rewards in Instruction Following Tasks

10. ❌ Credential Leakage in LLM Agent Skills: A Large-Scale Empirical Study

11. ❌ Towards Secure Agent Skills: Architecture, Threat Taxonomy, and Security Analysis

12. ❌ LLM-based Atomic Propositions help weak extractors: Evaluation of a Propositioner for triplet extraction

13. ❌ PRISM: LLM-Guided Semantic Clustering for High-Precision Topics

14. ❌ Adaptive Semantic Communication for Wireless Image Transmission Leveraging Mixture-of-Experts Mechanism

15. ❌ Speaking of Language: Reflections on Metalanguage Research in NLP

16. ❌ THOM: Generating Physically Plausible Hand-Object Meshes From Text

17. ❌ A semicontinuous relaxation of Saito’s criterion and freeness as angular minimization

18. ❌ Enhancing Robustness of Federated Learning via Server Learning

19. ❌ PR3DICTR: A modular AI framework for medical 3D image-based detection and outcome prediction

20. ❌ Coupled Control, Structured Memory, and Verifiable Action in Agentic AI (SCRAT – Stochastic Control with Retrieval and Auditable Trajectories): A Comparative Perspective from Squirrel Locomotion and Scatter-Hoarding

21. ❌ Reliability Gated Multi-Teacher Distillation for Low Resource Abstractive Summarization

22. ❌ Gradient Boosting within a Single Attention Layer

23. ❌ Reflective Context Learning: Studying the Optimization Primitives of Context Space

24. ❌ Understanding the Role of Hallucination in Reinforcement Post-Training of Multimodal Reasoning Models

25. ❌ Beyond the Parameters: A Technical Survey of Contextual Enrichment in Large Language Models: From In-Context Prompting to Causal Retrieval-Augmented Generation

26. ❌ Chart-RL: Policy Optimization Reinforcement Learning for Enhanced Visual Reasoning in Chart Question Answering with Vision Language Models

27. ❌ Valence-Arousal Subspace in LLMs: Circular Emotion Geometry and Multi-Behavioral Control

28. ❌ InCoder-32B-Thinking: Industrial Code World Model for Thinking

29. ❌ AI-Assisted Unit Test Writing and Test-Driven Code Refactoring: A Case Study

30. ❌ A Systematic Security Evaluation of OpenClaw and Its Variants

31. ❌ Domain-Adapted Retrieval for In-Context Annotation of Pedagogical Dialogue Acts

32. ❌ An Independent Safety Evaluation of Kimi K2.5

33. ❌ Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning

34. ❌ AlertStar: Path-Aware Alert Prediction on Hyper-Relational Knowledge Graphs

35. ❌ Co-Evolution of Policy and Internal Reward for Language Agents

36. ❌ A Data-Centric Vision Transformer Baseline for SAR Sea Ice Classification

37. ❌ Automatic Textbook Formalization

38. ❌ Verbalizing LLMs’ assumptions to explain and control sycophancy

39. ❌ Querying Structured Data Through Natural Language Using Language Models

40. ❌ MECO: A Multimodal Dataset for Emotion and Cognitive Understanding in Older Adults

41. ❌ Analyzing Healthcare Interoperability Vulnerabilities: Formal Modeling and Graph-Theoretic Approach

42. ❌ JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency

43. ❌ ARM: Advantage Reward Modeling for Long-Horizon Manipulation

44. ❌ Beyond Isolated Tasks: A Framework for Evaluating Coding Agents on Sequential Software Evolution

45. ❌ Comparing the Impact of Pedagogy-Informed Custom and General-Purpose GAI Chatbots on Students’ Science Problem-Solving Processes and Performance Using Heterogeneous Interaction Network Analysis

46. ❌ Agentic-MME: What Agentic Capability Really Brings to Multimodal Intelligence?

47. ❌ User-Aware Conditional Generative Total Correlation Learning for Multi-Modal Recommendation

48. ❌ FedSQ: Optimized Weight Averaging via Fixed Gating

49. ❌ Self-Optimizing Multi-Agent Systems for Deep Research

50. ❌ Mitigating Reward Hacking in RLHF via Advantage Sign Robustness

51. ❌ Prompt Compression in the Wild: Measuring Latency, Rate Adherence, and Quality for Faster LLM Inference

52. ❌ InfoSeeker: A Scalable Hierarchical Parallel Agent Framework for Web Information Seeking

53. ❌ LogicPoison: Logical Attacks on Graph Retrieval-Augmented Generation

54. ❌ How Annotation Trains Annotators: Competence Development in Social Influence Recognition

55. ❌ AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents

56. ❌ Learning from Synthetic Data via Provenance-Based Input Gradient Guidance

57. ❌ Council Mode: Mitigating Hallucination and Bias in LLMs via Multi-Agent Consensus

58. ❌ Split and Conquer Partial Deepfake Speech

59. ❌ Corporations Constitute Intelligence

60. ❌ Analysis of Optimality of Large Language Models on Planning Problems

61. ❌ RayMamba: Ray-Aligned Serialization for Long-Range 3D Object Detection

62. ❌ Toward an Artificial General Teacher: Procedural Geometry Data Generation and Visual Grounding with Vision-Language Models

63. ❌ Rethinking Forward Processes for Score-Based Data Assimilation in High Dimensions

64. ❌ One Model to Translate Them All? A Journey to Mount Doom for Multilingual Model Merging