📊 ArXiv 研究报告 (2026-04-10)

生成时间: 2026-04-10 09:21:27 数据源: ArXiv

📌 配置信息

关键词列表（共 27 个，总权重 27.0）

关键词	权重	类型
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	主要
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	主要
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	主要
“Scaling Laws” AND “Data Quality”	1.0	主要
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	主要
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	主要
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	主要
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	主要
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	主要
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	主要
“Context Window Extension” OR “Long Context LLMs”	1.0	主要
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	主要
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	主要
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	主要
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	主要
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	主要
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	主要
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	主要
“Multi-agent Systems” OR “Agent Coordination”	1.0	主要
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	主要
“Speculative Decoding” OR “Inference Acceleration”	1.0	主要
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	主要
“Mechanistic Interpretability” OR “Explainable AI”	1.0	主要
“World Models” AND “General World Models”	1.0	主要
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	主要
“In-context Learning” OR “Many-shot Learning”	1.0	主要
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	主要

评分设置

每个关键词最大分: 15
及格分公式: 5.0 + 0.8 × 总权重
当前及格分: 26.6

📈 论文统计

总抓取: 317 篇
及格论文: 16 篇 (5.0%)

⭐ 及格论文详细分析

1. Argus: Reorchestrating Static Analysis via a Multi-Agent Ensemble for Full-Chain Security Vulnerabil

作者: Zi Liang, Qipeng Xie, Jun He, Bohuan Xue, Weizheng Wang, Yuandao Cai, Fei Luo, Boxian Zhang, Haibo Hu, Kaishun Wu 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06633v1

评分: 65.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	10.0/10	10.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	5.0/10	5.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	10.0/10	10.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	10.0/10	10.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文Argus提出了一种基于LLM的多智能体框架，用于静态应用安全测试中的漏洞检测。核心相关关键词包括：LLMs（论文明确使用LLMs进行漏洞检测）、RAG（集成RAG技术减少幻觉）、LLM Agents/Multi-agent Systems（首个专门用于漏洞检测的多智能体框架）、Hallucination Mitigation（旨在减少幻觉和误报）。Chain of Thought/System 2 Thinking/Tool Use有一定关联，因为框架涉及推理深度和工具集成。其他关键词如MoE、SLMs、训练方法、模型压缩等未在摘要中提及，与论文内容无关。

!!! tip deepseek-chat TL;DR

论文提出Argus，一个基于LLM的多智能体框架，通过集成RAG和协作工作流来改进静态应用安全测试，显著提高了漏洞检测率并减少了误报和操作成本。

摘要翻译

近期，大型语言模型（LLM）的进展引发了其在静态应用安全测试（SAST）中应用的广泛关注，这主要归因于其相较于传统符号化或基于规则的方法具有更优越的上下文推理能力。然而，现有的基于LLM的方法通常试图直接替代人类专家，而未与现有SAST工具进行有效整合。这种整合的缺失导致了诸多缺陷，包括高误报率、幻觉问题、有限的推理深度以及过高的令牌使用量，使其难以在工业场景中实际部署。为克服这些局限，我们提出了一种范式转变，将SAST工作流从当前LLM辅助的结构重新编排为一种以LLM为核心的新型工作流。我们引入了Argus（智能体化与检索增强的防护系统），这是首个专为漏洞检测设计的多智能体框架。Argus包含三项关键创新：全面的软件供应链分析、协作式多智能体工作流，以及整合了检索增强生成（RAG）和ReAct等前沿技术，以最大限度减少幻觉并增强推理能力。大量实证评估表明，Argus在检测出更多真实漏洞的同时，显著降低了误报率和运行成本，其性能明显优于现有方法。值得注意的是，Argus已成功识别出多个已分配CVE编号的关键零日漏洞。

摘要 (Abstract)

Recent advancements in Large Language Models (LLMs) have sparked interest in their application to Static Application Security Testing (SAST), primarily due to their superior contextual reasoning capabilities compared to traditional symbolic or rule-based methods. However, existing LLM-based approaches typically attempt to replace human experts directly without integrating effectively with existing SAST tools. This lack of integration results in ineffectiveness, including high rates of false positives, hallucinations, limited reasoning depth, and excessive token usage, making them impractical for industrial deployment. To overcome these limitations, we present a paradigm shift that reorchestrates the SAST workflow from current LLM-assisted structure to a new LLM-centered workflow. We introduce Argus (Agentic and Retrieval-Augmented Guarding System), the first multi-agent framework designed specifically for vulnerability detection. Argus incorporates three key novelties: comprehensive supply chain analysis, collaborative multi-agent workflows, and the integration of state-of-the-art techniques such as Retrieval-Augmented Generation (RAG) and ReAct to minimize hallucinations and enhance reasoning. Extensive empirical evaluation demonstrates that Argus significantly outperforms existing methods by detecting a higher volume of true vulnerabilities while simultaneously reducing false positives and operational costs. Notably, Argus has identified several critical zero-day vulnerabilities with CVE assignments.

关键词: Large Language Models, Multi-agent Systems, Retrieval-Augmented Generation, Vulnerability Detection, Static Application Security Testing, Hallucination Mitigation, Agentic Workflow, Security Analysis

2. Chatbot-Based Assessment of Code Understanding in Automated Programming Assessment Systems

作者: Eduard Frankford, Erik Cikalleshi, Ruth Breu 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07304v1

评分: 55.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	5.0/10	5.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	5.0/10	5.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	10.0/10	10.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLM在编程教育评估中的应用，特别是解决LLM生成代码但学生理解不足的问题。高度相关关键词：LLMs（论文明确使用LLMs）、LLM Agents（提出双代理对话层框架）、Hallucination Mitigation（讨论幻觉问题并设计防护措施）。中等相关关键词：Small Language Models（提及本地模型部署选项）、Chain of Thought/System 2 Thinking（涉及逐步推理和深入理解）、Multi-agent Systems（双代理架构）、Explainable AI（关注代码理解解释）。其他关键词如MoE、Scaling Laws、RLHF等与论文技术细节无关。

!!! tip deepseek-chat TL;DR

该论文研究如何利用大语言模型和对话代理框架来评估学生在自动化编程评估系统中的代码理解能力，提出了一个结合确定性代码分析和双代理对话层的混合苏格拉底框架，以解决LLM生成代码但学生理解不足的问题。

摘要翻译

大型语言模型（LLM）对传统的自动化编程评估提出了挑战，因为学生现在能够生成功能正确的代码，却未必展示出相应的理解能力。本文作出两项贡献。首先，报告了一项基于饱和度的范围综述，聚焦于编程教育中的对话式评估方法。该综述识别出三种主流的架构类型：基于规则或模板驱动的系统、基于LLM的系统以及混合系统。纵观现有文献，对话智能体在提供可扩展的反馈和深入探查代码理解方面展现出潜力，但仍存在重要局限，包括幻觉问题、过度依赖、隐私安全、学术诚信以及部署限制。其次，本文将这些发现综合成一个混合苏格拉底框架，用于将对话式验证整合到自动化编程评估系统（APAS）中。该框架结合了确定性代码分析与双智能体对话层、知识追踪、支架式提问机制，以及将提示与运行时事实相绑定的防护措施。本文还讨论了针对LLM生成解释的实践性保障策略，包括监考部署模式、随机化的跟踪问题、与具体执行状态绑定的逐步推理，以及适用于隐私敏感环境的本地模型部署选项。该框架并非旨在取代传统测试，而是作为一种补充层，用于验证学生是否理解其提交的代码。

摘要 (Abstract)

Large Language Models (LLMs) challenge conventional automated programming assessment because students can now produce functionally correct code without demonstrating corresponding understanding. This paper makes two contributions. First, it reports a saturation-based scoping review of conversational assessment approaches in programming education. The review identifies three dominant architectural families: rule-based or template-driven systems, LLM-based systems, and hybrid systems. Across the literature, conversational agents appear promising for scalable feedback and deeper probing of code understanding, but important limitations remain around hallucinations, over-reliance, privacy, integrity, and deployment constraints. Second, the paper synthesizes these findings into a Hybrid Socratic Framework for integrating conversational verification into Automated Programming Assessment Systems (APASs). The framework combines deterministic code analysis with a dual-agent conversational layer, knowledge tracking, scaffolded questioning, and guardrails that tie prompts to runtime facts. The paper also discusses practical safeguards against LLM-generated explanations, including proctored deployment modes, randomized trace questions, stepwise reasoning tied to concrete execution states, and local-model deployment options for privacy-sensitive settings. Rather than replacing conventional testing, the framework is intended as a complementary layer for verifying whether students understand the code they submit.

关键词: Large Language Models, Automated Programming Assessment, Conversational Assessment, Code Understanding, Hybrid Socratic Framework, Dual-agent Architecture, Hallucination Mitigation, Educational Technology

3. Is Biomedical Specialization Still Worth It? Insights from Domain-Adaptive Language Modelling with a

作者: Aidan Mannion, Cécile Macaire, Armand Violle, Stéphane Ohayon, Xavier Tannier, Didier Schwab, Lorraine Goeuriot, François Portet 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06903v1

评分: 50.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	5.0/10	5.0
“Scaling Laws” AND “Data Quality”	1.0	5.0/10	5.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	10.0/10	10.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	10.0/10	10.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 论文核心研究大语言模型（LLMs）在法语生物医学领域的领域自适应预训练（DAPT），与关键词1高度相关（10分）。论文明确使用DAPT进行领域适应，与关键词5高度相关（10分）。研究生物医学AI应用，与关键词27高度相关（10分）。论文提到使用中小型LLMs，与关键词3有一定关联（5分）。论文强调高质量法语生物医学文本的收集，与关键词4有一定关联（5分）。论文发现模型合并（model merging）对缓解泛化权衡至关重要，与关键词25高度相关（10分）。其他关键词如MoE、SFT、RLHF、RAG、推理方法、代理、压缩等未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

本研究探讨了领域自适应预训练（DAPT）在法语生物医学领域对中小型大语言模型进行专业化的有效性，发现DAPT在资源受限的小规模场景下可行，但模型合并对于缓解专业任务性能提升与通用能力下降之间的权衡至关重要。

摘要翻译

大型语言模型（LLM）已在多个领域展现出卓越能力，但其在专业领域——尤其是非英语语言环境中的适应仍具挑战性。本研究探讨了领域自适应预训练（DAPT）作为一种策略，通过持续预训练将中小型LLM专业化应用于法语生物医学领域。我们聚焦两个核心研究问题：专业化持续预训练对于领域适应的可行性，以及领域特定性能提升与通用能力退化之间的关系。本研究的贡献包括：发布一个完全开放许可、适用于商业和开源应用的法语生物医学语料库；训练并发布专业化的法语生物医学LLM；以及为DAPT实施提供新的见解。我们的方法涵盖高质量法语生物医学文本的收集与精炼、基于DAPT的因果语言建模方法探索，以及开展广泛的对比评估。与先前研究相反，我们的结果对DAPT的有效性提出了质疑，但同时指出在资源受限的小规模场景中，若条件得当，该策略仍具可行性。本文进一步发现，DAPT后的模型融合对于缓解泛化性能的权衡至关重要，在某些情况下甚至能提升DAPT所针对的专业任务表现。

摘要 (Abstract)

Large language models (LLMs) have demonstrated remarkable capabilities across diverse domains, yet their adaptation to specialized fields remains challenging, particularly for non-English languages. This study investigates domain-adaptive pre-training (DAPT) as a strategy for specializing small to mid-sized LLMs in the French biomedical domain through continued pre-training. We address two key research questions: the viability of specialized continued pre-training for domain adaptation and the relationship between domain-specific performance gains and general capability degradation. Our contributions include the release of a fully open-licensed French biomedical corpus suitable for commercial and open-source applications, the training and release of specialized French biomedical LLMs, and novel insights for DAPT implementation. Our methodology encompasses the collection and refinement of high-quality French biomedical texts, the exploration of causal language modeling approaches using DAPT, and conducting extensive comparative evaluations. Our results cast doubt on the efficacy of DAPT, in contrast to previous works, but we highlight its viability in smaller-scale, resource-constrained scenarios under the right conditions. Findings in this paper further suggest that model merging post-DAPT is essential to mitigate generalization trade-offs, and in some cases even improves performance on specialized tasks at which the DAPT was directed.

关键词: Domain-adaptive pre-training, French biomedical domain, Large language models, Continued pre-training, Model merging, Domain adaptation, Biomedical corpus, Generalization trade-offs

4. TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories

作者: Yen-Shan Chen, Sian-Yao Huang, Cheng-Lin Yang, Yun-Nung Chen 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07223v1

评分: 45.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	10.0/10	10.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	5.0/10	5.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLM作为自主代理在多步工具调用轨迹中的安全防护（guardrails），与"Large Language Models"、“LLM Agents”、“Tool Use"高度相关（10分）。涉及安全对齐（alignment）和幻觉缓解（hallucination mitigation），但非核心焦点，给5分。涉及多步推理（multi-step reasoning）评估，给5分。其他关键词如MoE、量化、RAG等未在摘要中提及，给0分。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型作为自主代理在多步工具调用轨迹中的安全防护效能，发现防护效果更多依赖于结构数据能力而非语义安全对齐，且模型架构比规模对风险检测性能影响更大。

摘要翻译

随着大语言模型从静态聊天机器人演变为自主智能体，其主要脆弱性已从最终输出转向中间执行轨迹。尽管针对自然语言响应的安全护栏已有成熟的基准测试，但其在多步骤工具使用轨迹中的有效性仍未被充分探索。为填补这一空白，我们推出了TraceSafe-Bench——首个专门用于评估轨迹中安全性的综合基准。该基准涵盖12个风险类别，从安全威胁（如提示词注入、隐私泄露）到操作故障（如幻觉、接口不一致），包含超过1,000个独特的执行实例。通过对13个作为护栏的大语言模型和7个专用护栏的评估，我们得出三个关键发现：1）结构瓶颈：护栏效能更多取决于结构化数据处理能力（如JSON解析），而非语义安全对齐。其性能与结构化到文本的基准测试呈现强相关性（$ρ=0.79$），但与标准越狱鲁棒性几乎无关联。2）架构优于规模：模型架构对风险检测性能的影响显著大于模型规模，通用大语言模型在轨迹分析中始终优于专用安全护栏。3）时间稳定性：准确性在长轨迹中保持稳健。执行步骤的增加使模型能够从静态工具定义转向动态执行行为，反而提升了后期阶段的风险检测性能。我们的研究表明，保障智能体工作流程的安全需要联合优化结构化推理与安全对齐，以有效缓解轨迹中的风险。

摘要 (Abstract)

As large language models (LLMs) evolve from static chatbots into autonomous agents, the primary vulnerability surface shifts from final outputs to intermediate execution traces. While safety guardrails are well-benchmarked for natural language responses, their efficacy remains largely unexplored within multi-step tool-use trajectories. To address this gap, we introduce TraceSafe-Bench, the first comprehensive benchmark specifically designed to assess mid-trajectory safety. It encompasses 12 risk categories, ranging from security threats (e.g., prompt injection, privacy leaks) to operational failures (e.g., hallucinations, interface inconsistencies), featuring over 1,000 unique execution instances. Our evaluation of 13 LLM-as-a-guard models and 7 specialized guardrails yields three critical findings: 1) Structural Bottleneck: Guardrail efficacy is driven more by structural data competence (e.g., JSON parsing) than semantic safety alignment. Performance correlates strongly with structured-to-text benchmarks ($ρ=0.79$) but shows near-zero correlation with standard jailbreak robustness. 2) Architecture over Scale: Model architecture influences risk detection performance more significantly than model size, with general-purpose LLMs consistently outperforming specialized safety guardrails in trajectory analysis. 3) Temporal Stability: Accuracy remains resilient across extended trajectories. Increased execution steps allow models to pivot from static tool definitions to dynamic execution behaviors, actually improving risk detection performance in later stages. Our findings suggest that securing agentic workflows requires jointly optimizing for structural reasoning and safety alignment to effectively mitigate mid-trajectory risks.

关键词: LLM guardrails, multi-step tool-calling, autonomous agents, safety benchmark, trajectory analysis, risk detection, agentic workflows, structural reasoning

5. TurboAgent: An LLM-Driven Autonomous Multi-Agent Framework for Turbomachinery Aerodynamic Design

作者: Juan Du, Yueteng Wu, Pan Zhao, Yuze Liu, Min Zhang, Xiaobin Xu, Xinglong Zhang 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06747v1

评分: 45.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	5.0/10	5.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	10.0/10	10.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 论文提出TurboAgent，一个LLM驱动的自主多智能体框架，用于涡轮机械气动设计。核心是LLM作为任务规划和协调中心，结合多个专门智能体（生成设计、性能预测、优化、验证）。因此，与"Large Language Models”、“LLM Agents”、“Multi-agent Systems"高度相关（10分）。LLM协调专门智能体执行任务，与"Tool Use"有一定关联（5分）。应用领域是工程科学（涡轮机械设计），属于"AI for Science"范畴（10分）。其他关键词（如MoE、SFT、RAG、量化等）未在摘要中提及，与论文技术内容无关（0分）。

!!! tip deepseek-chat TL;DR

该研究提出了一个基于大语言模型的自主多智能体框架TurboAgent，用于解决涡轮机械气动设计中传统试错方法效率低、难以实现端到端自动化的问题，实验验证表明该框架能够从自然语言需求自动生成最终设计，在30分钟内完成闭环设计流程，并显著提升性能指标。

摘要翻译

涡轮机械的气动设计是一个复杂且紧密耦合的多阶段过程，涉及几何生成、性能预测、优化和高保真物理验证。现有的智能设计方法通常侧重于单个阶段或依赖于松散耦合的流水线，这使得实现完全自主的端到端设计具有挑战性。为解决这一问题，本研究提出了TurboAgent，一个由大语言模型驱动、用于涡轮机械气动设计与优化的自主多智能体框架。该框架以大语言模型作为任务规划与协调的核心，同时由专门化的智能体负责生成式设计、快速性能预测、多目标优化以及基于物理的验证。该框架将传统的试错式设计转变为数据驱动的协同工作流，并保留高保真仿真用于最终验证。
研究采用一个跨音速单转子压气机进行验证。结果表明，目标性能、生成的设计与计算流体动力学仿真结果高度吻合。质量流量、总压比和等熵效率的决定系数均超过0.91，归一化均方根误差值低于8%。优化智能体进一步将等熵效率提升了1.61%，总压比提升了3.02%。在并行计算环境下，完整工作流可在约30分钟内执行完毕。
这些结果证明，TurboAgent能够实现从自然语言需求到最终设计生成的自主闭环设计流程，为涡轮机械气动设计提供了一种高效且可扩展的范式。

摘要 (Abstract)

The aerodynamic design of turbomachinery is a complex and tightly coupled multi-stage process involving geometry generation, performance prediction, optimization, and high-fidelity physical validation. Existing intelligent design approaches typically focus on individual stages or rely on loosely coupled pipelines, making fully autonomous end-to-end design challenging.To address this issue, this study proposes TurboAgent, a large language model (LLM)-driven autonomous multi-agent framework for turbomachinery aerodynamic design and optimization. The LLM serves as the core for task planning and coordination, while specialized agents handle generative design, rapid performance prediction, multi-objective optimization, and physics-based validation. The framework transforms traditional trial-and-error design into a data-driven collaborative workflow, with high-fidelity simulations retained for final verification.A transonic single-rotor compressor is used for validation. The results show strong agreement between target performance, generated designs, and CFD simulations. The coefficients of determination (R2) for mass flow rate, total pressure ratio, and isentropic efficiency all exceed 0.91, with normalized RMSE values below 8%. The optimization agent further improves isentropic efficiency by 1.61% and total pressure ratio by 3.02%. The complete workflow can be executed within approximately 30 minutes under parallel computing. These results demonstrate that TurboAgent enables an autonomous closed-loop design process from natural language requirements to final design generation, providing an efficient and scalable paradigm for turbomachinery aerodynamic design

关键词: LLM-driven autonomous multi-agent framework, turbomachinery aerodynamic design, task planning and coordination, specialized agents, generative design, multi-objective optimization, physics-based validation, closed-loop design process

6. ReDAct: Uncertainty-Aware Deferral for LLM Agents

作者: Dzianis Piatrashyn, Nikita Kotelevskii, Kirill Grishchenkov, Nikita Glazkov, Ivan Nasonov, Ilya Makarov, Timothy Baldwin, Preslav Nakov, Roman Vashurin, Maxim Panov 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07036v1

评分: 45.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	10.0/10	10.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	5.0/10	5.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	10.0/10	10.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文提出ReDAct框架，核心研究LLM Agents在顺序决策中的幻觉问题，通过结合小型和大型LLM，基于不确定性阈值进行决策延迟，以平衡成本与性能。因此，与"Large Language Models”、“Small Language Models”、“LLM Agents"和"Hallucination Mitigation"高度相关（10分），因为这些是论文的核心内容。与"Speculative Decoding"有一定关联（5分），因为论文涉及推理成本优化，但未直接讨论解码加速技术。其他关键词如MoE、Scaling Laws、Pre-training等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对LLM Agents在顺序决策中因幻觉导致错误累积的问题，提出了ReDAct框架，通过结合小型和大型LLM并基于不确定性阈值进行决策延迟，在仅将约15%的决策延迟给大型模型的情况下，实现了与全程使用大型模型相当的性能，同时显著降低了推理成本。

摘要翻译

近年来，基于大语言模型（LLM）的智能体在众多应用场景中日益普及，包括复杂的序列决策问题。然而，它们也继承了大语言模型容易产生幻觉的倾向，从而导致错误决策。在序列化环境中，即使单个错误也可能不可逆转地破坏任务轨迹，使得幻觉问题更为突出。尽管规模更大的大语言模型幻觉更少，但其每词元（per-token）计算成本显著更高。本文通过提出ReDAct（推理-延迟-执行）框架来解决这一权衡问题。在ReDAct中，智能体配备了两个大语言模型：一个默认使用的小型廉价模型，以及一个更可靠但昂贵的大型模型。当小型模型的预测不确定性超过校准阈值时，决策将被延迟交由大型模型处理。我们在ALFWorld和MiniGrid等基于文本的具身环境中评估了该方法，结果表明：仅将约15%的决策延迟交由大型模型处理，即可达到完全使用大型模型的质量水平，同时显著降低推理成本。

摘要 (Abstract)

Recently, LLM-based agents have become increasingly popular across many applications, including complex sequential decision-making problems. However, they inherit the tendency of LLMs to hallucinate, leading to incorrect decisions. In sequential settings, even a single mistake can irreversibly degrade the trajectory, making hallucinations an even bigger problem. Although larger LLMs hallucinate less, they incur a significantly higher per-token cost. In this paper, we address this tradeoff by proposing ReDAct (Reason-Defer-Act). In ReDAct, an agent is equipped with two LLMs: a small, cheap model used by default, and a large, more reliable but expensive model. When the predictive uncertainty of the small model exceeds a calibrated threshold, the decision is deferred to the large model. We evaluate our approach in text-based embodied environments such as ALFWorld and MiniGrid and show that deferring only about 15% of decisions to the large model can match the quality of using it exclusively, while significantly reducing inference costs.

关键词: LLM Agents, Uncertainty-Aware Deferral, Hallucination Mitigation, Sequential Decision-Making, Inference Cost Reduction, Small Language Models, Large Language Models, ReDAct

7. When to Call an Apple Red: Humans Follow Introspective Rules, VLMs Don’t

作者: Jonathan Nemitz, Carsten Eickhoff, Junyi Jessy Li, Kyle Mahowald, Michal Golovanevsky, William Rudman 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06422v1

评分: 44.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	8.0/10	8.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	8.0/10	8.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	8.0/10	8.0
“World Models” AND “General World Models”	1.0	5.0/10	5.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文研究Vision-Language Models（VLMs）的推理一致性和自省能力，与LLMs相关（5分），涉及推理过程（Chain of Thought/System 2 Thinking各5分）、自省与校准（Self-Correction 8分）、事实性与幻觉缓解（Hallucination Mitigation 8分）、可解释性（Mechanistic Interpretability 8分）和世界知识先验（World Models 5分）。其他关键词如MoE、训练技术、压缩、代理等未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文研究发现，视觉语言模型（VLMs）在颜色标注任务中会系统性地违反其自省推理规则，与人类不同，这表明VLM的推理失败并非由任务难度驱动，而是其自省知识存在误校准。

摘要翻译

理解视觉-语言模型（VLMs）何时会出现意外行为、模型能否可靠预测自身行为，以及模型是否遵循其内省推理过程，是实现可信部署的核心挑战。为研究这些问题，我们引入了分级颜色归因（Graded Color Attribution, GCA）数据集——一个旨在激发决策规则并评估参与者对这些规则遵循程度的受控基准。GCA包含三种条件下像素级颜色覆盖度各异的线条图：基于世界知识的重新着色、反事实重新着色，以及无颜色先验的形状。通过GCA，视觉-语言模型与人类参与者均会建立一个阈值：物体必须包含特定颜色像素的最小百分比才能获得该颜色标签。随后我们将这些规则与它们后续的颜色归因决策进行比较。研究发现表明，模型会系统性违背其自身的内省规则。例如，在具有强颜色先验的物体上，GPT-5-mini在近60%的情况下违反了其声明的内省规则。人类参与者则始终遵循其声明的规则，任何表面上的违规行为均可通过一种有充分记录的倾向——即高估颜色覆盖度——来解释。相比之下，我们发现视觉-语言模型能出色地估计颜色覆盖度，却在最终响应中公然违背自身的推理过程。在所有模型及激发内省规则的策略中，世界知识先验会系统性降低遵循程度，且这种降低方式并不反映人类认知模式。我们的研究结果挑战了“视觉-语言模型推理失败源于任务难度”的观点，表明视觉-语言模型的内省自我认知存在校准偏差，这对高风险场景下的部署具有直接启示。

摘要 (Abstract)

Understanding when Vision-Language Models (VLMs) will behave unexpectedly, whether models can reliably predict their own behavior, and if models adhere to their introspective reasoning are central challenges for trustworthy deployment. To study this, we introduce the Graded Color Attribution (GCA) dataset, a controlled benchmark designed to elicit decision rules and evaluate participant faithfulness to these rules. GCA consists of line drawings that vary pixel-level color coverage across three conditions: world-knowledge recolorings, counterfactual recolorings, and shapes with no color priors. Using GCA, both VLMs and human participants establish a threshold: the minimum percentage of pixels of a given color an object must have to receive that color label. We then compare these rules with their subsequent color attribution decisions. Our findings reveal that models systematically violate their own introspective rules. For example, GPT-5-mini violates its stated introspection rules in nearly 60% of cases on objects with strong color priors. Human participants remain faithful to their stated rules, with any apparent violations being explained by a well-documented tendency to overestimate color coverage. In contrast, we find that VLMs are excellent estimators of color coverage, yet blatantly contradict their own reasoning in their final responses. Across all models and strategies for eliciting introspective rules, world-knowledge priors systematically degrade faithfulness in ways that do not mirror human cognition. Our findings challenge the view that VLM reasoning failures are difficulty-driven and suggest that VLM introspective self-knowledge is miscalibrated, with direct implications for high-stakes deployment.

关键词: Vision-Language Models, introspective reasoning, faithfulness, color attribution, world-knowledge priors, self-knowledge calibration, reasoning failures, trustworthy deployment

8. Does a Global Perspective Help Prune Sparse MoEs Elegantly?

作者: Zeliang Zhang, Nikhil Ghosh, Jiani Liu, Bin Yu, Xiaodong Liu 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06542v1

评分: 41.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	15.0/10	15.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	8.0/10	8.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	8.0/10	8.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究稀疏混合专家模型（MoE）的剪枝方法，与"Mixture of Experts"高度相关（15分），直接涉及LLM效率问题（10分）。论文提到"Empirical scaling laws”，与"Scaling Laws"相关（8分）。剪枝属于模型压缩范畴，与"Quantization"相关（8分）。其他关键词如SLMs、训练方法、推理技术、对齐、RAG、代理等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对稀疏混合专家模型（MoE）内存消耗大的问题，提出了一种全局冗余感知剪枝策略（GRAPE），动态分配剪枝预算，实验表明在相同预算下，GRAPE在多个MoE模型上比局部基线方法平均提升1.40%的准确率。

摘要翻译

语言模型的实证缩放定律推动了大型语言模型（LLM）规模的持续扩大，尽管其计算和内存成本不断增长。稀疏专家混合模型（Sparse Mixture-of-Experts, MoE）提供了一种前景广阔的替代方案，它仅在每次前向传播中激活一部分专家，从而在不牺牲性能的前提下提升了效率。然而，大量的专家参数仍导致显著的内存消耗。
现有的剪枝方法通常在各层间均匀分配预算，忽视了稀疏MoE中产生的异构冗余。我们提出了GRAPE（面向专家的全局冗余感知剪枝），这是一种全局剪枝策略，能够基于跨层冗余动态分配剪枝预算。在Mixtral-8x7B、Mixtral-8x22B、DeepSeek-MoE、Qwen-MoE和GPT-OSS上的实验表明，在相同剪枝预算下，GRAPE始终能取得最佳的平均性能。在论文报告的三个主要模型上，相较于最强的局部基线方法，GRAPE在不同剪枝设置下的平均准确率平均提升了1.40%，最高提升可达2.45%。

摘要 (Abstract)

Empirical scaling laws for language models have encouraged the development of ever-larger LLMs, despite their growing computational and memory costs. Sparse Mixture-of-Experts (MoEs) offer a promising alternative by activating only a subset of experts per forward pass, improving efficiency without sacrificing performance. However, the large number of expert parameters still leads to substantial memory consumption. Existing pruning methods typically allocate budgets uniformly across layers, overlooking the heterogeneous redundancy that arises in sparse MoEs. We propose GRAPE (Global Redundancy-Aware Pruning of Experts, a global pruning strategy that dynamically allocates pruning budgets based on cross-layer redundancy. Experiments on Mixtral-8x7B, Mixtral-8x22B, DeepSeek-MoE, Qwen-MoE, and GPT-OSS show that, under the same pruning budget, GRAPE consistently achieves the best average performance. On the three main models reported in the paper, it improves average accuracy over the strongest local baseline by 1.40% on average across pruning settings, with gains of up to 2.45%.

关键词: Sparse Mixture-of-Experts, MoE pruning, Global pruning strategy, Model compression, Memory efficiency, Cross-layer redundancy, GRAPE, Large language models

9. A Systematic Study of Retrieval Pipeline Design for Retrieval-Augmented Medical Question Answering

作者: Nusrat Sultana, Abdullah Muhammad Moosa, Kazi Afzalur Rahman, Sajal Chandra Banik 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07274v1

评分: 40.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	15.0/10	15.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	5.0/10	5.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 论文核心研究检索增强生成（RAG）在医学问答中的应用，因此与"Retrieval-Augmented Generation"高度相关（15分）。论文明确使用LLMs进行医学问答，与"Large Language Models"高度相关（10分）。研究属于AI在科学（医学）领域的应用，与"AI for Science"高度相关（10分）。论文提到RAG用于解决知识差距和事实基础问题，间接涉及幻觉缓解，因此与"Hallucination Mitigation"有一定关联（5分）。其他关键词如MoE、SLMs、训练技术、推理方法、代理系统等均未在摘要中提及，完全无关（0分）。

!!! tip deepseek-chat TL;DR

该论文系统研究了检索增强生成（RAG）在医学问答中的应用，通过评估40种配置发现检索增强显著提升了零样本医学问答性能，最佳配置达到60.49%准确率，并揭示了检索效果与计算成本之间的权衡。

摘要翻译

大语言模型（LLM）在医学问答任务中展现出强大能力；然而，纯参数化模型常受限于知识缺口与事实依据不足。检索增强生成（RAG）通过将外部知识检索整合至推理过程，有效应对了这一局限。尽管基于RAG的医学系统日益受到关注，但各检索组件对系统性能的具体影响仍未得到充分理解。本研究基于MedQA USMLE基准与结构化教科书知识库，对检索增强型医学问答进行了系统性评估。我们在包含四十种配置的统一实验框架内，分析了语言模型、嵌入模型、检索策略、查询重构与交叉编码器重排序之间的交互作用。结果表明，检索增强显著提升了零样本医学问答性能。最佳配置方案——结合查询重构与重排序的稠密检索——实现了60.49%的准确率。研究还发现，领域专用语言模型比通用模型能更有效地利用检索到的医学证据。进一步分析揭示了检索效能与计算成本间的明确权衡：较简单的稠密检索配置在保持高吞吐量的同时，仍能提供强劲性能。所有实验均在单张消费级GPU上完成，证明检索增强型医学问答系统的系统性评估可在适度计算资源下实现。

摘要 (Abstract)

Large language models (LLMs) have demonstrated strong capabilities in medical question answering; however, purely parametric models often suffer from knowledge gaps and limited factual grounding. Retrieval-augmented generation (RAG) addresses this limitation by integrating external knowledge retrieval into the reasoning process. Despite increasing interest in RAG-based medical systems, the impact of individual retrieval components on performance remains insufficiently understood. This study presents a systematic evaluation of retrieval-augmented medical question answering using the MedQA USMLE benchmark and a structured textbook-based knowledge corpus. We analyze the interaction between language models, embedding models, retrieval strategies, query reformulation, and cross-encoder reranking within a unified experimental framework comprising forty configurations. Results show that retrieval augmentation significantly improves zero-shot medical question answering performance. The best-performing configuration was dense retrieval with query reformulation and reranking achieved 60.49% accuracy. Domain-specialized language models were also found to better utilize retrieved medical evidence than general-purpose models. The analysis further reveals a clear tradeoff between retrieval effectiveness and computational cost, with simpler dense retrieval configurations providing strong performance while maintaining higher throughput. All experiments were conducted on a single consumer-grade GPU, demonstrating that systematic evaluation of retrieval-augmented medical QA systems can be performed under modest computational resources.

关键词: Retrieval-Augmented Generation, Medical Question Answering, Large Language Models, Retrieval Pipeline, MedQA, Zero-shot Learning, Dense Retrieval, Query Reformulation

10. Geometric Properties of the Voronoi Tessellation in Latent Semantic Manifolds of Large Language Mode

作者: Marshall Brett 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06767v1

评分: 40.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	10.0/10	10.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	5.0/10	5.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	10.0/10	10.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（Qwen3.5-4B-Base）的表示几何特性，包括Voronoi镶嵌、表达性差距的线性缩放定律验证，以及通过边缘优化程序进行几何重组。因此，与"Large Language Models"和"Scaling Laws"高度相关（10分）。研究涉及模型内部表示和几何结构的分析，与"Mechanistic Interpretability"高度相关（10分）。论文提到bfloat16量化伪影，与"Quantization"有一定关联（5分）。研究涉及模型内部表示与最终输出的对齐，与"Alignment"有一定关联（5分）。其他关键词如MoE、SLMs、训练方法、推理加速、AI for Science等均未在摘要中涉及，评为0分。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型（Qwen3.5-4B-Base）潜在语义流形中的Voronoi镶嵌几何特性，验证了表达性差距的线性缩放定律，并提出了一种通过边缘优化程序进行几何重组的方法，可在不重新训练的情况下改善模型表示。

摘要翻译

语言模型在离散的标记上操作，但在连续的向量空间中计算，从而在表示流形上诱导出沃罗诺伊镶嵌结构。本研究基于Qwen3.5-4B-Base模型对此镶嵌结构进行了实证分析，并做出两项贡献。首先，通过采用float32边界重计算以消除bfloat16量化伪影，我们验证了Mabrok（2026）提出的表达能力间隙线性缩放定律（$R^2$ = 0.9997），这是迄今为止最强有力的证实；同时，我们发现了一个中间层的几何模糊区域（第24-28层，$ρ$ = -0.29），其中边界几何结构与交叉熵呈负相关，直至最终层才结晶为对齐状态（$ρ$ = 0.836）。
其次，我们证明收敛模型的沃罗诺伊镶嵌结构可通过边界优化程序进行重塑：这是一种无需重新训练的简短事后优化过程，旨在拓宽标记决策边界。我们在剂量反应扫描中比较了直接边界最大化与费舍尔信息距离最大化两种方法。两种方法均发现，在每评估256K个位置中，可修正位置的上限约为16,300个，但其关键差异在于附带损害。边界最大化的损害随干预强度增加而加剧，直至修正效果被淹没；而费舍尔方法在验证范围（$λ$ = 0.15-0.6）内损害保持恒定（约5,300个位置），在$λ$ = 0.6时实现中值边界提升+28%，且下游基准测试结果保持不变——这是一种压缩表达能力间隙同时保留其缩放定律的几何重组。然而，频率与标记类别审计显示，增益集中于高频结构标记（$λ$ = 0.6时净修正量的84%），而内容类及实体类标记的贡献随$λ$升高而缩减。因此，费舍尔边界优化程序是一种可行的几何精修工具，其实际上限并非由总体损害决定，而是取决于标记级收益的均匀性。

摘要 (Abstract)

Language models operate on discrete tokens but compute in continuous vector spaces, inducing a Voronoi tessellation over the representation manifold. We study this tessellation empirically on Qwen3.5-4B-Base, making two contributions. First, using float32 margin recomputation to resolve bfloat16 quantization artifacts, we validate Mabrok’s (2026) linear scaling law of the expressibility gap with $R^2$ = 0.9997 - the strongest confirmation to date - and identify a mid-layer geometric ambiguity regime where margin geometry is anti-correlated with cross-entropy (layers 24-28, $ρ$ = -0.29) before crystallizing into alignment at the final layer ($ρ$ = 0.836). Second, we show that the Voronoi tessellation of a converged model is reshapable through margin refinement procedures (MRP): short post-hoc optimization runs that widen token-decision margins without retraining. We compare direct margin maximization against Fisher information distance maximization across a dose-response sweep. Both methods find the same ceiling of ~16,300 correctable positions per 256K evaluated, but differ critically in collateral damage. Margin maximization damage escalates with intervention strength until corrections are overwhelmed. Fisher damage remains constant at ~5,300 positions across the validated range ($λ$ = 0.15-0.6), achieving +28% median margin improvement at $λ$ = 0.6 with invariant downstream benchmarks - a geometric reorganization that compresses the expressibility gap while preserving its scaling law. However, frequency and token-class audits reveal that gains concentrate in high-frequency structural tokens (84% of net corrections at $λ$ = 0.6), with content and entity-like contributions shrinking at higher $λ$. Fisher MRP is therefore a viable geometric polishing tool whose practical ceiling is set not by aggregate damage but by the uniformity of token-level benefit.

关键词: Large Language Models, Voronoi tessellation, latent semantic manifolds, scaling laws, expressibility gap, margin refinement, geometric properties, model interpretability

11. StructKV: Preserving the Structural Skeleton for Scalable Long-Context Inference

作者: Zhirui Chen, Peiyang Liu, Ling Shao 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06746v1

评分: 40.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	10.0/10	10.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	10.0/10	10.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	5.0/10	5.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	5.0/10	5.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLMs长上下文推理中的KV缓存压缩问题，与"Large Language Models"、“Context Window Extension”、“KV Cache Compression"高度相关（10分）。论文提出StructKV框架，属于模型压缩和推理加速技术，与"Quantization”、“Speculative Decoding"有一定关联（5分）。其他关键词如MoE、SFT、RAG、AI for Science等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

论文针对大语言模型在长上下文推理中KV缓存线性增长导致的内存瓶颈问题，提出了StructKV结构感知压缩框架，通过全局中心性度量和动态层检测有效保持了长距离依赖和检索鲁棒性。

摘要翻译

随着大语言模型（LLM）扩展至支持超过百万令牌的上下文窗口，键值（KV）缓存的线性增长带来了严重的内存容量与带宽瓶颈，制约了长上下文推理的效率。现有的压缩方法通常基于局部显著性度量对令牌进行优先级排序，以将预填充计算与解码内存解耦。然而，这些方法往往依赖于特定网络层的局部显著性快照，从而系统性地丢弃了那些在整个网络深度中充当全局信息枢纽、但在选定进行剪枝的特定层上暂时处于休眠状态的令牌。为解决这一局限，我们提出了StructKV，一种结构感知的KV缓存压缩框架，其引入了三项核心创新：首先，全局入度中心性通过聚合网络深度上的注意力模式来识别全局信息枢纽。其次，动态枢纽检测利用信息论度量自适应地定位最佳压缩层。最后，结构传播与解耦将计算预算与内存存储预算分离。在LongBench和RULER基准测试上的实验结果表明，StructKV能有效保持长程依赖性与检索鲁棒性。

摘要 (Abstract)

As Large Language Models (LLMs) scale to support context windows exceeding one million tokens, the linear growth of Key-Value (KV) cache imposes severe memory capacity and bandwidth bottlenecks, constraining the efficiency of long-context inference. Existing compression approaches typically prioritize tokens based on local saliency metrics to decouple prefill computation from decoding memory. However, these methods often rely on local saliency snapshots at a specific layer, thereby systematically discarding tokens that act as global information hubs across the network depth but appear temporarily dormant at the specific layer selected for pruning. To address this limitation, we propose StructKV, a structure-aware KV cache compression framework that introduces three core innovations: First, Global In-Degree Centrality aggregates attention patterns across the network depth to identify global information hubs. Second, Dynamic Pivot Detection utilizes information-theoretic metrics to adaptively locate the optimal layer for compression. Finally, Structural Propagation and Decoupling separates the computational budget from the memory storage budget. Experimental results on the LongBench and RULER benchmarks demonstrate that StructKV effectively preserves long-range dependencies and retrieval robustness.

关键词: Large Language Models, KV Cache Compression, Long Context Inference, Memory Bottleneck, Structural Skeleton, Global Information Hubs, Attention Patterns, Inference Efficiency

12. Multi-Turn Reasoning LLMs for Task Offloading in Mobile Edge Computing

作者: Ning Yang, Chuangxin Cheng, Haijun Zhang 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07148v1

评分: 40.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	10.0/10	10.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文提出COMLLM框架，将LLMs应用于移动边缘计算（MEC）中的任务卸载决策。核心相关性体现在：1）明确使用LLMs（高度相关）；2）采用Supervised Fine-Tuning（SFT）作为基线方法（高度相关）；3）提出多步蒙特卡洛推演（Monte Carlo rollouts）进行前瞻决策，与"Monte Carlo Tree Search”（MCTS）概念高度相关；4）框架实现多步推理，与"Chain of Thought"（CoT）高度相关。其他关键词如MoE、SLMs、Scaling Laws、RAG等未在摘要中提及或与论文主题无关。

!!! tip deepseek-chat TL;DR

该论文针对移动边缘计算中动态任务卸载策略设计困难的问题，提出了COMLLM框架，通过集成GRPO和前瞻协作模拟机制，实现了近最优延迟和零样本拓扑可扩展性，优于SFT、DRL和启发式基线方法。

摘要翻译

新兴计算密集型应用对资源受限的移动设备提出了严格的延迟要求。移动边缘计算（Mobile Edge Computing, MEC）通过任务卸载应对这一挑战。然而，由于动态任务到达、时变信道以及服务器队列的时空耦合特性，设计高效策略仍然困难。传统启发式方法缺乏适应性，而深度强化学习（Deep Reinforcement Learning, DRL）存在泛化能力有限和架构僵化的问题，当网络拓扑变化时需要重新训练。尽管大语言模型（Large Language Models, LLMs）具备语义推理能力，但标准的监督微调（Supervised Fine-Tuning, SFT）会产生短视策略，仅贪婪地最小化即时延迟，而未考虑系统的长期演化。为克服这些局限，我们提出了COMLLM，一个支持在MEC系统中进行前瞻性决策的生成式框架。COMLLM将组相对策略优化（Group Relative Policy Optimization, GRPO）与前瞻协同仿真（Look-Ahead Collaborative Simulation, LACS）机制相结合，该机制在执行多步蒙特卡洛推演的同时联合建模服务器队列动态。通过将这些推演纳入奖励设计，框架能够捕捉当前决策对未来系统状态的长期影响。实验结果表明，COMLLM实现了接近最优的延迟并提升了负载均衡的公平性。值得注意的是，该框架展现出零样本拓扑可扩展性，使得在小型网络上训练的模型能够无需重新训练即可泛化到更大、未见过的拓扑结构，其性能优于SFT、DRL及启发式基线方法。

摘要 (Abstract)

Emerging computation-intensive applications impose stringent latency requirements on resource-constrained mobile devices. Mobile Edge Computing (MEC) addresses this challenge through task offloading. However, designing effective policies remains difficult due to dynamic task arrivals, time-varying channels, and the spatio-temporal coupling of server queues. Conventional heuristics lack adaptability, while Deep Reinforcement Learning (DRL) suffers from limited generalization and architectural rigidity, requiring retraining when network topology changes. Although Large Language Models (LLMs) offer semantic reasoning capabilities, standard Supervised Fine-Tuning (SFT) yields myopic policies that greedily minimize immediate latency without accounting for long-term system evolution. To address these limitations, we propose COMLLM, a generative framework that enables foresighted decision-making in MEC systems. COMLLM integrates Group Relative Policy Optimization (GRPO) with a Look-Ahead Collaborative Simulation (LACS) mechanism, which performs multi-step Monte Carlo rollouts while jointly modeling server queue dynamics. By incorporating these rollouts into the reward design, the framework captures the long-term impact of current decisions on future system states. Experimental results demonstrate that COMLLM achieves near-optimal latency and improved load-balancing fairness. Notably, it exhibits zero-shot topological scalability, allowing a model trained on small-scale networks to generalize to larger, unseen topologies without retraining, outperforming SFT, DRL, and heuristic baselines.

关键词: Large Language Models, Mobile Edge Computing, Task Offloading, Monte Carlo Rollouts, Supervised Fine-Tuning, Multi-step Reasoning, Zero-shot Generalization, Latency Optimization

13. An empirical study of LoRA-based fine-tuning of large language models for automated test case genera

作者: Milad Moradi, Ke Yan, David Colwell, Rhona Asgari 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06946v1

评分: 35.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	15.0/10	15.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LoRA（参数高效微调）在大型语言模型（LLMs）上的应用，用于自动化测试用例生成。因此，与"PEFT/LoRA"高度相关（15分），与"LLMs"和"Supervised Fine-tuning"直接相关（10分）。其他关键词如MoE、SLMs、Scaling Laws、RAG、RLHF等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文通过实证研究，证明了使用LoRA对大型语言模型进行参数高效微调，能显著提升从自然语言需求生成自动化测试用例的性能，并使开源模型达到接近专有模型的水平。

摘要翻译

基于自然语言需求自动生成测试用例仍是软件工程中的一个挑战性问题，这源于需求的模糊性以及需要生成结构化、可执行的测试制品。大语言模型的最新进展为解决该任务带来了希望；然而，其有效性取决于针对特定任务的适应性和高效的微调策略。本文对使用参数高效微调方法（特别是LoRA）进行基于需求的测试用例生成展开了全面的实证研究。我们在统一的实验流程下评估了包括开源模型和专有模型在内的多个大语言模型家族。该研究系统性地探讨了LoRA关键超参数（包括秩、缩放因子和丢弃率）对下游性能的影响。我们提出了一个基于GPT-4o的自动化评估框架，从九个质量维度对生成的测试用例进行评估。实验结果表明，基于LoRA的微调显著提升了所有开源模型的性能，其中Ministral-8B模型取得了最佳效果。此外，我们发现经过微调的8B开源模型可以达到与未经微调的GPT-4.1模型相当的性能，这凸显了参数高效适应方法的有效性。虽然GPT-4.1模型取得了最高的整体性能，但微调后专有模型与开源模型之间的性能差距显著缩小。这些发现为自动化测试生成的模型选择、微调策略和评估方法提供了重要见解。特别地，研究证明，结合精心设计的微调方法，具有成本效益、可本地部署的开源模型能够成为专有系统的可行替代方案。

摘要 (Abstract)

Automated test case generation from natural language requirements remains a challenging problem in software engineering due to the ambiguity of requirements and the need to produce structured, executable test artifacts. Recent advances in LLMs have shown promise in addressing this task; however, their effectiveness depends on task-specific adaptation and efficient fine-tuning strategies. In this paper, we present a comprehensive empirical study on the use of parameter-efficient fine-tuning, specifically LoRA, for requirement-based test case generation. We evaluate multiple LLM families, including open-source and proprietary models, under a unified experimental pipeline. The study systematically explores the impact of key LoRA hyperparameters, including rank, scaling factor, and dropout, on downstream performance. We propose an automated evaluation framework based on GPT-4o, which assesses generated test cases across nine quality dimensions. Experimental results demonstrate that LoRA-based fine-tuning significantly improves the performance of all open-source models, with Ministral-8B achieving the best results among them. Furthermore, we show that a fine-tuned 8B open-source model can achieve performance comparable to pre-fine-tuned GPT-4.1 models, highlighting the effectiveness of parameter-efficient adaptation. While GPT-4.1 models achieve the highest overall performance, the performance gap between proprietary and open-source models is substantially reduced after fine-tuning. These findings provide important insights into model selection, fine-tuning strategies, and evaluation methods for automated test generation. In particular, they demonstrate that cost-efficient, locally deployable open-source models can serve as viable alternatives to proprietary systems when combined with well-designed fine-tuning approaches.

关键词: LoRA, parameter-efficient fine-tuning, large language models, automated test case generation, empirical study, open-source models, GPT-4, evaluation framework

14. SentinelSphere: Integrating AI-Powered Real-Time Threat Detection with Cybersecurity Awareness Train

作者: Nikolaos D. Tantaroudas, Ilias Karachalios, Andrew J. McCracken 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06900v1

评分: 35.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	5.0/10	5.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	5.0/10	5.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	10.0/10	10.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心是LLM在网络安全领域的应用，明确提到使用LLM驱动安全培训，因此"Large Language Models"得10分。使用量化Phi-4模型，与"Quantization"高度相关得10分。模型针对网络安全领域微调，与"Pre-training"和"Post-training"有一定关联各得5分。使用量化模型在普通硬件上部署，与"Small Language Models"有一定关联得5分。其他关键词如MoE、Scaling Laws、RAG、RLHF等未涉及得0分。

!!! tip deepseek-chat TL;DR

该研究针对网络安全领域人才短缺和人为因素漏洞问题，提出了SentinelSphere平台，通过集成基于深度学习的威胁检测和LLM驱动的安全培训，实验证明能有效提高检测准确性并降低误报，同时验证了其教育组件对非技术用户的实用性。

摘要翻译

网络安全领域面临两大相互关联的挑战：全球范围内合格从业人员的短缺，以及持续存在的人为因素弱点——后者是大多数安全事件的主要原因。为应对这些问题，我们提出了SentinelSphere，这是一个由人工智能驱动的平台，它将基于机器学习的威胁识别与大型语言模型（LLM）驱动的安全培训相结合。其检测模块采用一个增强型深度神经网络（DNN），该网络在CIC-IDS2017和CIC-DDoS2019基准数据集上训练，并辅以新颖的HTTP层特征工程，以捕捉应用层攻击特征。在教育组件方面，我们部署了Phi-4模型的量化变体（Q4_K_M），该模型针对网络安全领域进行了微调，使其能够在仅需16 GB RAM、无需专用GPU资源的商用硬件上部署。实验结果表明，相较于基线模型，增强型DNN在保持高检测准确率的同时，显著降低了误报率，并且在DDoS、暴力破解及基于网络的漏洞利用等关键攻击类别上保持了强大的召回率。涉及行业专业人士和大学生的验证研讨会证实，其“交通灯”可视化系统和对话式AI助手对于非技术背景用户而言既直观又有效。SentinelSphere表明，将智能威胁检测与自适应的、LLM驱动的安全教育相结合，能够在一个统一、连贯的框架内有意义地应对技术和人为因素两方面的网络安全漏洞。

摘要 (Abstract)

The field of cybersecurity is confronted with two interrelated challenges: a worldwide deficit of qualified practitioners and ongoing human-factor weaknesses that account for the bulk of security incidents. To tackle these issues, we present SentinelSphere, a platform driven by artificial intelligence that unifies machine learning-based threat identification with security training powered by a Large Language Model (LLM). The detection module uses an Enhanced Deep Neural Network (DNN) trained on the CIC-IDS2017 and CIC-DDoS2019 benchmark datasets, enriched with novel HTTP-layer feature engineering that captures application level attack signatures. For the educational component, we deploy a quantised variant of Phi-4 model (Q4_K_M), fine-tuned for the cybersecurity domain, enabling deployment on commodity hardware requiring only 16 GB of RAM without dedicated GPU resources. Experimental results show that the Enhanced DNN attains high detection accuracy while substantially lowering false positives relative to baseline models, and maintains strong recall across critical attack categories such as DDoS, brute force, and web-based exploits. Validation workshops involving industry professionals and university students confirmed that the Traffic Light visualisation system and conversational AI assistant are both intuitive and effective for users without technical backgrounds. SentinelSphere illustrates that coupling intelligent threat detection with adaptive, LLM-driven security education can meaningfully address both technical and human-factor cybersecurity vulnerabilities within a single, cohesive framework.

关键词: cybersecurity, threat detection, Large Language Model, quantization, fine-tuning, DNN, Phi-4, security training

15. Beyond Surface Judgments: Human-Grounded Risk Evaluation of LLM-Generated Disinformation

作者: Zonghuan Xu, Xiang Zheng, Yutao Wu, Xingjun Ma 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06820v1

评分: 28.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	8.0/10	8.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLM生成虚假信息的风险评估，直接涉及LLM关键词（10分）。研究评估LLM作为评估代理与人类读者的一致性，与"Alignment"（5分）和"Explainable AI"（5分）相关。研究虚假信息风险与"Hallucination Mitigation"（8分）在事实性和真实性方面有间接关联。其他关键词如MoE、Scaling Laws、RLHF、RAG等均未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了LLM作为评估代理在评估LLM生成的虚假信息风险时与人类读者响应的一致性，发现LLM评估者之间高度一致但与人类读者存在显著差距，表明内部一致性不能作为有效代理人类响应的证据。

摘要翻译

大型语言模型（LLM）能够大规模生成具有说服力的叙事，这引发了人们对其可能被用于虚假信息活动的担忧。评估这一风险最终需要理解读者如何接收此类内容。然而在实践中，LLM评判者正日益被用作直接人类评估的低成本替代品，尽管它们是否能够忠实反映读者反应尚不明确。我们将此情境下的评估重新界定为代理效度问题，并依据人类读者反应对LLM评判者进行审计。通过使用290篇校准文章、2,043组配对的人类评分以及八个前沿评判模型的输出，我们从整体评分、项目级排序和信号依赖性三个方面检验了评判者与人类的一致性。我们发现评判者与人类之间存在持续性的差距。相较于人类，评判者通常更为严苛，仅能微弱地复现项目级的人类排序，并且依赖不同的文本信号——更注重逻辑严谨性，同时对情感强度给予更强的负面评价。与此同时，评判者彼此之间的共识度远高于其与人类读者的共识度。这些结果表明，LLM评判者形成了一个内部高度一致的评价群体，其内部一致性远高于其与人类读者的一致性，这表明内部共识并不能作为其作为读者反应代理的有效性证据。

摘要 (Abstract)

Large language models (LLMs) can generate persuasive narratives at scale, raising concerns about their potential use in disinformation campaigns. Assessing this risk ultimately requires understanding how readers receive such content. In practice, however, LLM judges are increasingly used as a low-cost substitute for direct human evaluation, even though whether they faithfully track reader responses remains unclear. We recast evaluation in this setting as a proxy-validity problem and audit LLM judges against human reader responses. Using 290 aligned articles, 2,043 paired human ratings, and outputs from eight frontier judges, we examine judge–human alignment in terms of overall scoring, item-level ordering, and signal dependence. We find persistent judge–human gaps throughout. Relative to humans, judges are typically harsher, recover item-level human rankings only weakly, and rely on different textual signals, placing more weight on logical rigour while penalizing emotional intensity more strongly. At the same time, judges agree far more with one another than with human readers. These results suggest that LLM judges form a coherent evaluative group that is much more aligned internally than it is with human readers, indicating that internal agreement is not evidence of validity as a proxy for reader response.

关键词: Large Language Models, LLM-generated disinformation, human evaluation, proxy-validity, judge-human alignment, risk assessment, reader response, evaluative gap

16. Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization

作者: Qiyao Ma, Dechen Gao, Rui Cai, Boqi Zhao, Hanchu Zhou, Junshan Zhang, Zhe Zhao 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07343v1

评分: 28.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	10.0/10	10.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	8.0/10	8.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究奖励模型（RMs）在个性化对齐评估中的表现，与LLMs和Alignment高度相关（核心内容，10分），与RLHF相关（论文提到PPO和奖励模型评估，有一定关联，8分），其他关键词未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了Personalized RewardBench基准，用于评估奖励模型在捕捉个性化用户偏好方面的能力，发现现有奖励模型在个性化任务上表现不佳（最高准确率75.94%），且该基准能更准确地预测下游任务性能。

摘要翻译

多元对齐已成为大型语言模型发展的关键前沿领域，其中奖励模型作为捕捉多样化人类价值观的核心机制。尽管针对通用响应质量的基准测试已较为普遍，但如何评估奖励模型对个体用户偏好的建模能力仍是一个开放挑战。为填补这一空白，我们提出了个性化奖励基准——一种旨在严格评估奖励模型建模个性化偏好能力的新型基准。我们基于对用户特定准则的严格遵守（或违反）构建了优选与拒选响应配对，确保偏好区分完全针对个体量身定制。特别值得注意的是，人工评估证实配对间的主要区分因素严格限于个人偏好，且两种响应均保持较高的通用质量（如正确性、相关性和帮助性）。广泛测试表明，现有最先进的奖励模型在个性化任务上表现显著不足，最高准确率仅为75.94%。关键的是，由于有效的奖励模型基准应能预测其在下游任务中的表现，我们通过实验证明：与现有基线相比，该基准在最佳N采样和近端策略优化两种下游任务中，与奖励模型性能的相关性均显著更高。这些发现确立了个性化奖励基准作为评估奖励模型下游应用性能的稳健且精确的代理标准。

摘要 (Abstract)

Pluralistic alignment has emerged as a critical frontier in the development of Large Language Models (LLMs), with reward models (RMs) serving as a central mechanism for capturing diverse human values. While benchmarks for general response quality are prevalent, evaluating how well reward models account for individual user preferences remains an open challenge. To bridge this gap, we introduce Personalized RewardBench, a novel benchmark designed to rigorously assess reward models’ capacity to model personalized preferences. We construct chosen and rejected response pairs based on strict adherence to (or violation of) user-specific rubrics, ensuring that preference distinctions are uniquely tailored to the individual. In particular, human evaluations confirm that the primary discriminative factor between pairs is strictly personal preference, with both responses maintaining high general quality (e.g., correctness, relevance and helpfulness). Extensive testing reveals that existing state-of-the-art reward models struggle significantly with personalization, peaking at an accuracy of just 75.94%. Crucially, because an effective reward model benchmark should predict a reward model’s performance on downstream tasks, we conduct experiments demonstrating that our benchmark exhibits a significantly higher correlation with downstream performance in both Best-of-N (BoN) sampling and Proximal Policy Optimization (PPO) compared to existing baselines. These findings establish Personalized RewardBench as a robust and accurate proxy for evaluating reward models’ performance in downstream applications.

关键词: Personalized RewardBench, reward models, pluralistic alignment, human preferences, benchmark evaluation, downstream performance, Best-of-N sampling, Proximal Policy Optimization

📋 所有论文列表

1. ✅ Argus: Reorchestrating Static Analysis via a Multi-Agent Ensemble for Full-Chain Security Vulnerability Detection

评分: 65.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	10.0/10	10.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	5.0/10	5.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	10.0/10	10.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	10.0/10	10.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

论文提出Argus，一个基于LLM的多智能体框架，通过集成RAG和协作工作流来改进静态应用安全测试，显著提高了漏洞检测率并减少了误报和操作成本。

摘要翻译

近期，大型语言模型（LLM）的进展引发了其在静态应用安全测试（SAST）中应用的广泛关注，这主要归因于其相较于传统符号化或基于规则的方法具有更优越的上下文推理能力。然而，现有的基于LLM的方法通常试图直接替代人类专家，而未与现有SAST工具进行有效整合。这种整合的缺失导致了诸多缺陷，包括高误报率、幻觉问题、有限的推理深度以及过高的令牌使用量，使其难以在工业场景中实际部署。为克服这些局限，我们提出了一种范式转变，将SAST工作流从当前LLM辅助的结构重新编排为一种以LLM为核心的新型工作流。我们引入了Argus（智能体化与检索增强的防护系统），这是首个专为漏洞检测设计的多智能体框架。Argus包含三项关键创新：全面的软件供应链分析、协作式多智能体工作流，以及整合了检索增强生成（RAG）和ReAct等前沿技术，以最大限度减少幻觉并增强推理能力。大量实证评估表明，Argus在检测出更多真实漏洞的同时，显著降低了误报率和运行成本，其性能明显优于现有方法。值得注意的是，Argus已成功识别出多个已分配CVE编号的关键零日漏洞。

摘要 (Abstract)

Recent advancements in Large Language Models (LLMs) have sparked interest in their application to Static Application Security Testing (SAST), primarily due to their superior contextual reasoning capabilities compared to traditional symbolic or rule-based methods. However, existing LLM-based approaches typically attempt to replace human experts directly without integrating effectively with existing SAST tools. This lack of integration results in ineffectiveness, including high rates of false positives, hallucinations, limited reasoning depth, and excessive token usage, making them impractical for industrial deployment. To overcome these limitations, we present a paradigm shift that reorchestrates the SAST workflow from current LLM-assisted structure to a new LLM-centered workflow. We introduce Argus (Agentic and Retrieval-Augmented Guarding System), the first multi-agent framework designed specifically for vulnerability detection. Argus incorporates three key novelties: comprehensive supply chain analysis, collaborative multi-agent workflows, and the integration of state-of-the-art techniques such as Retrieval-Augmented Generation (RAG) and ReAct to minimize hallucinations and enhance reasoning. Extensive empirical evaluation demonstrates that Argus significantly outperforms existing methods by detecting a higher volume of true vulnerabilities while simultaneously reducing false positives and operational costs. Notably, Argus has identified several critical zero-day vulnerabilities with CVE assignments.

2. ✅ Chatbot-Based Assessment of Code Understanding in Automated Programming Assessment Systems

作者: Eduard Frankford, Erik Cikalleshi, Ruth Breu 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07304v1

评分: 55.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	5.0/10	5.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	5.0/10	5.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	10.0/10	10.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文研究如何利用大语言模型和对话代理框架来评估学生在自动化编程评估系统中的代码理解能力，提出了一个结合确定性代码分析和双代理对话层的混合苏格拉底框架，以解决LLM生成代码但学生理解不足的问题。

摘要翻译

大型语言模型（LLM）对传统的自动化编程评估提出了挑战，因为学生现在能够生成功能正确的代码，却未必展示出相应的理解能力。本文作出两项贡献。首先，报告了一项基于饱和度的范围综述，聚焦于编程教育中的对话式评估方法。该综述识别出三种主流的架构类型：基于规则或模板驱动的系统、基于LLM的系统以及混合系统。纵观现有文献，对话智能体在提供可扩展的反馈和深入探查代码理解方面展现出潜力，但仍存在重要局限，包括幻觉问题、过度依赖、隐私安全、学术诚信以及部署限制。其次，本文将这些发现综合成一个混合苏格拉底框架，用于将对话式验证整合到自动化编程评估系统（APAS）中。该框架结合了确定性代码分析与双智能体对话层、知识追踪、支架式提问机制，以及将提示与运行时事实相绑定的防护措施。本文还讨论了针对LLM生成解释的实践性保障策略，包括监考部署模式、随机化的跟踪问题、与具体执行状态绑定的逐步推理，以及适用于隐私敏感环境的本地模型部署选项。该框架并非旨在取代传统测试，而是作为一种补充层，用于验证学生是否理解其提交的代码。

摘要 (Abstract)

Large Language Models (LLMs) challenge conventional automated programming assessment because students can now produce functionally correct code without demonstrating corresponding understanding. This paper makes two contributions. First, it reports a saturation-based scoping review of conversational assessment approaches in programming education. The review identifies three dominant architectural families: rule-based or template-driven systems, LLM-based systems, and hybrid systems. Across the literature, conversational agents appear promising for scalable feedback and deeper probing of code understanding, but important limitations remain around hallucinations, over-reliance, privacy, integrity, and deployment constraints. Second, the paper synthesizes these findings into a Hybrid Socratic Framework for integrating conversational verification into Automated Programming Assessment Systems (APASs). The framework combines deterministic code analysis with a dual-agent conversational layer, knowledge tracking, scaffolded questioning, and guardrails that tie prompts to runtime facts. The paper also discusses practical safeguards against LLM-generated explanations, including proctored deployment modes, randomized trace questions, stepwise reasoning tied to concrete execution states, and local-model deployment options for privacy-sensitive settings. Rather than replacing conventional testing, the framework is intended as a complementary layer for verifying whether students understand the code they submit.

3. ✅ Is Biomedical Specialization Still Worth It? Insights from Domain-Adaptive Language Modelling with a New French Health Corpus

评分: 50.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	5.0/10	5.0
“Scaling Laws” AND “Data Quality”	1.0	5.0/10	5.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	10.0/10	10.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	10.0/10	10.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

!!! tip deepseek-chat TL;DR

本研究探讨了领域自适应预训练（DAPT）在法语生物医学领域对中小型大语言模型进行专业化的有效性，发现DAPT在资源受限的小规模场景下可行，但模型合并对于缓解专业任务性能提升与通用能力下降之间的权衡至关重要。

摘要翻译

大型语言模型（LLM）已在多个领域展现出卓越能力，但其在专业领域——尤其是非英语语言环境中的适应仍具挑战性。本研究探讨了领域自适应预训练（DAPT）作为一种策略，通过持续预训练将中小型LLM专业化应用于法语生物医学领域。我们聚焦两个核心研究问题：专业化持续预训练对于领域适应的可行性，以及领域特定性能提升与通用能力退化之间的关系。本研究的贡献包括：发布一个完全开放许可、适用于商业和开源应用的法语生物医学语料库；训练并发布专业化的法语生物医学LLM；以及为DAPT实施提供新的见解。我们的方法涵盖高质量法语生物医学文本的收集与精炼、基于DAPT的因果语言建模方法探索，以及开展广泛的对比评估。与先前研究相反，我们的结果对DAPT的有效性提出了质疑，但同时指出在资源受限的小规模场景中，若条件得当，该策略仍具可行性。本文进一步发现，DAPT后的模型融合对于缓解泛化性能的权衡至关重要，在某些情况下甚至能提升DAPT所针对的专业任务表现。

摘要 (Abstract)

Large language models (LLMs) have demonstrated remarkable capabilities across diverse domains, yet their adaptation to specialized fields remains challenging, particularly for non-English languages. This study investigates domain-adaptive pre-training (DAPT) as a strategy for specializing small to mid-sized LLMs in the French biomedical domain through continued pre-training. We address two key research questions: the viability of specialized continued pre-training for domain adaptation and the relationship between domain-specific performance gains and general capability degradation. Our contributions include the release of a fully open-licensed French biomedical corpus suitable for commercial and open-source applications, the training and release of specialized French biomedical LLMs, and novel insights for DAPT implementation. Our methodology encompasses the collection and refinement of high-quality French biomedical texts, the exploration of causal language modeling approaches using DAPT, and conducting extensive comparative evaluations. Our results cast doubt on the efficacy of DAPT, in contrast to previous works, but we highlight its viability in smaller-scale, resource-constrained scenarios under the right conditions. Findings in this paper further suggest that model merging post-DAPT is essential to mitigate generalization trade-offs, and in some cases even improves performance on specialized tasks at which the DAPT was directed.

关键词: Domain-adaptive pre-training, French biomedical domain, Large language models, Continued pre-training, Model merging, Domain adaptation, Biomedical corpus, Generalization trade-offs

4. ✅ TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories

作者: Yen-Shan Chen, Sian-Yao Huang, Cheng-Lin Yang, Yun-Nung Chen 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07223v1

评分: 45.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	10.0/10	10.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	5.0/10	5.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型作为自主代理在多步工具调用轨迹中的安全防护效能，发现防护效果更多依赖于结构数据能力而非语义安全对齐，且模型架构比规模对风险检测性能影响更大。

摘要翻译

随着大语言模型从静态聊天机器人演变为自主智能体，其主要脆弱性已从最终输出转向中间执行轨迹。尽管针对自然语言响应的安全护栏已有成熟的基准测试，但其在多步骤工具使用轨迹中的有效性仍未被充分探索。为填补这一空白，我们推出了TraceSafe-Bench——首个专门用于评估轨迹中安全性的综合基准。该基准涵盖12个风险类别，从安全威胁（如提示词注入、隐私泄露）到操作故障（如幻觉、接口不一致），包含超过1,000个独特的执行实例。通过对13个作为护栏的大语言模型和7个专用护栏的评估，我们得出三个关键发现：1）结构瓶颈：护栏效能更多取决于结构化数据处理能力（如JSON解析），而非语义安全对齐。其性能与结构化到文本的基准测试呈现强相关性（$ρ=0.79$），但与标准越狱鲁棒性几乎无关联。2）架构优于规模：模型架构对风险检测性能的影响显著大于模型规模，通用大语言模型在轨迹分析中始终优于专用安全护栏。3）时间稳定性：准确性在长轨迹中保持稳健。执行步骤的增加使模型能够从静态工具定义转向动态执行行为，反而提升了后期阶段的风险检测性能。我们的研究表明，保障智能体工作流程的安全需要联合优化结构化推理与安全对齐，以有效缓解轨迹中的风险。

摘要 (Abstract)

As large language models (LLMs) evolve from static chatbots into autonomous agents, the primary vulnerability surface shifts from final outputs to intermediate execution traces. While safety guardrails are well-benchmarked for natural language responses, their efficacy remains largely unexplored within multi-step tool-use trajectories. To address this gap, we introduce TraceSafe-Bench, the first comprehensive benchmark specifically designed to assess mid-trajectory safety. It encompasses 12 risk categories, ranging from security threats (e.g., prompt injection, privacy leaks) to operational failures (e.g., hallucinations, interface inconsistencies), featuring over 1,000 unique execution instances. Our evaluation of 13 LLM-as-a-guard models and 7 specialized guardrails yields three critical findings: 1) Structural Bottleneck: Guardrail efficacy is driven more by structural data competence (e.g., JSON parsing) than semantic safety alignment. Performance correlates strongly with structured-to-text benchmarks ($ρ=0.79$) but shows near-zero correlation with standard jailbreak robustness. 2) Architecture over Scale: Model architecture influences risk detection performance more significantly than model size, with general-purpose LLMs consistently outperforming specialized safety guardrails in trajectory analysis. 3) Temporal Stability: Accuracy remains resilient across extended trajectories. Increased execution steps allow models to pivot from static tool definitions to dynamic execution behaviors, actually improving risk detection performance in later stages. Our findings suggest that securing agentic workflows requires jointly optimizing for structural reasoning and safety alignment to effectively mitigate mid-trajectory risks.

关键词: LLM guardrails, multi-step tool-calling, autonomous agents, safety benchmark, trajectory analysis, risk detection, agentic workflows, structural reasoning

5. ✅ TurboAgent: An LLM-Driven Autonomous Multi-Agent Framework for Turbomachinery Aerodynamic Design

作者: Juan Du, Yueteng Wu, Pan Zhao, Yuze Liu, Min Zhang, Xiaobin Xu, Xinglong Zhang 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06747v1

评分: 45.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	5.0/10	5.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	10.0/10	10.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

!!! tip deepseek-chat TL;DR

该研究提出了一个基于大语言模型的自主多智能体框架TurboAgent，用于解决涡轮机械气动设计中传统试错方法效率低、难以实现端到端自动化的问题，实验验证表明该框架能够从自然语言需求自动生成最终设计，在30分钟内完成闭环设计流程，并显著提升性能指标。

摘要翻译

涡轮机械的气动设计是一个复杂且紧密耦合的多阶段过程，涉及几何生成、性能预测、优化和高保真物理验证。现有的智能设计方法通常侧重于单个阶段或依赖于松散耦合的流水线，这使得实现完全自主的端到端设计具有挑战性。为解决这一问题，本研究提出了TurboAgent，一个由大语言模型驱动、用于涡轮机械气动设计与优化的自主多智能体框架。该框架以大语言模型作为任务规划与协调的核心，同时由专门化的智能体负责生成式设计、快速性能预测、多目标优化以及基于物理的验证。该框架将传统的试错式设计转变为数据驱动的协同工作流，并保留高保真仿真用于最终验证。
研究采用一个跨音速单转子压气机进行验证。结果表明，目标性能、生成的设计与计算流体动力学仿真结果高度吻合。质量流量、总压比和等熵效率的决定系数均超过0.91，归一化均方根误差值低于8%。优化智能体进一步将等熵效率提升了1.61%，总压比提升了3.02%。在并行计算环境下，完整工作流可在约30分钟内执行完毕。
这些结果证明，TurboAgent能够实现从自然语言需求到最终设计生成的自主闭环设计流程，为涡轮机械气动设计提供了一种高效且可扩展的范式。

摘要 (Abstract)

The aerodynamic design of turbomachinery is a complex and tightly coupled multi-stage process involving geometry generation, performance prediction, optimization, and high-fidelity physical validation. Existing intelligent design approaches typically focus on individual stages or rely on loosely coupled pipelines, making fully autonomous end-to-end design challenging.To address this issue, this study proposes TurboAgent, a large language model (LLM)-driven autonomous multi-agent framework for turbomachinery aerodynamic design and optimization. The LLM serves as the core for task planning and coordination, while specialized agents handle generative design, rapid performance prediction, multi-objective optimization, and physics-based validation. The framework transforms traditional trial-and-error design into a data-driven collaborative workflow, with high-fidelity simulations retained for final verification.A transonic single-rotor compressor is used for validation. The results show strong agreement between target performance, generated designs, and CFD simulations. The coefficients of determination (R2) for mass flow rate, total pressure ratio, and isentropic efficiency all exceed 0.91, with normalized RMSE values below 8%. The optimization agent further improves isentropic efficiency by 1.61% and total pressure ratio by 3.02%. The complete workflow can be executed within approximately 30 minutes under parallel computing. These results demonstrate that TurboAgent enables an autonomous closed-loop design process from natural language requirements to final design generation, providing an efficient and scalable paradigm for turbomachinery aerodynamic design

6. ✅ ReDAct: Uncertainty-Aware Deferral for LLM Agents

评分: 45.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	10.0/10	10.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	5.0/10	5.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	10.0/10	10.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文针对LLM Agents在顺序决策中因幻觉导致错误累积的问题，提出了ReDAct框架，通过结合小型和大型LLM并基于不确定性阈值进行决策延迟，在仅将约15%的决策延迟给大型模型的情况下，实现了与全程使用大型模型相当的性能，同时显著降低了推理成本。

摘要翻译

近年来，基于大语言模型（LLM）的智能体在众多应用场景中日益普及，包括复杂的序列决策问题。然而，它们也继承了大语言模型容易产生幻觉的倾向，从而导致错误决策。在序列化环境中，即使单个错误也可能不可逆转地破坏任务轨迹，使得幻觉问题更为突出。尽管规模更大的大语言模型幻觉更少，但其每词元（per-token）计算成本显著更高。本文通过提出ReDAct（推理-延迟-执行）框架来解决这一权衡问题。在ReDAct中，智能体配备了两个大语言模型：一个默认使用的小型廉价模型，以及一个更可靠但昂贵的大型模型。当小型模型的预测不确定性超过校准阈值时，决策将被延迟交由大型模型处理。我们在ALFWorld和MiniGrid等基于文本的具身环境中评估了该方法，结果表明：仅将约15%的决策延迟交由大型模型处理，即可达到完全使用大型模型的质量水平，同时显著降低推理成本。

摘要 (Abstract)

Recently, LLM-based agents have become increasingly popular across many applications, including complex sequential decision-making problems. However, they inherit the tendency of LLMs to hallucinate, leading to incorrect decisions. In sequential settings, even a single mistake can irreversibly degrade the trajectory, making hallucinations an even bigger problem. Although larger LLMs hallucinate less, they incur a significantly higher per-token cost. In this paper, we address this tradeoff by proposing ReDAct (Reason-Defer-Act). In ReDAct, an agent is equipped with two LLMs: a small, cheap model used by default, and a large, more reliable but expensive model. When the predictive uncertainty of the small model exceeds a calibrated threshold, the decision is deferred to the large model. We evaluate our approach in text-based embodied environments such as ALFWorld and MiniGrid and show that deferring only about 15% of decisions to the large model can match the quality of using it exclusively, while significantly reducing inference costs.

关键词: LLM Agents, Uncertainty-Aware Deferral, Hallucination Mitigation, Sequential Decision-Making, Inference Cost Reduction, Small Language Models, Large Language Models, ReDAct

7. ✅ When to Call an Apple Red: Humans Follow Introspective Rules, VLMs Don’t

评分: 44.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	8.0/10	8.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	8.0/10	8.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	8.0/10	8.0
“World Models” AND “General World Models”	1.0	5.0/10	5.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文研究发现，视觉语言模型（VLMs）在颜色标注任务中会系统性地违反其自省推理规则，与人类不同，这表明VLM的推理失败并非由任务难度驱动，而是其自省知识存在误校准。

摘要翻译

理解视觉-语言模型（VLMs）何时会出现意外行为、模型能否可靠预测自身行为，以及模型是否遵循其内省推理过程，是实现可信部署的核心挑战。为研究这些问题，我们引入了分级颜色归因（Graded Color Attribution, GCA）数据集——一个旨在激发决策规则并评估参与者对这些规则遵循程度的受控基准。GCA包含三种条件下像素级颜色覆盖度各异的线条图：基于世界知识的重新着色、反事实重新着色，以及无颜色先验的形状。通过GCA，视觉-语言模型与人类参与者均会建立一个阈值：物体必须包含特定颜色像素的最小百分比才能获得该颜色标签。随后我们将这些规则与它们后续的颜色归因决策进行比较。研究发现表明，模型会系统性违背其自身的内省规则。例如，在具有强颜色先验的物体上，GPT-5-mini在近60%的情况下违反了其声明的内省规则。人类参与者则始终遵循其声明的规则，任何表面上的违规行为均可通过一种有充分记录的倾向——即高估颜色覆盖度——来解释。相比之下，我们发现视觉-语言模型能出色地估计颜色覆盖度，却在最终响应中公然违背自身的推理过程。在所有模型及激发内省规则的策略中，世界知识先验会系统性降低遵循程度，且这种降低方式并不反映人类认知模式。我们的研究结果挑战了“视觉-语言模型推理失败源于任务难度”的观点，表明视觉-语言模型的内省自我认知存在校准偏差，这对高风险场景下的部署具有直接启示。

摘要 (Abstract)

Understanding when Vision-Language Models (VLMs) will behave unexpectedly, whether models can reliably predict their own behavior, and if models adhere to their introspective reasoning are central challenges for trustworthy deployment. To study this, we introduce the Graded Color Attribution (GCA) dataset, a controlled benchmark designed to elicit decision rules and evaluate participant faithfulness to these rules. GCA consists of line drawings that vary pixel-level color coverage across three conditions: world-knowledge recolorings, counterfactual recolorings, and shapes with no color priors. Using GCA, both VLMs and human participants establish a threshold: the minimum percentage of pixels of a given color an object must have to receive that color label. We then compare these rules with their subsequent color attribution decisions. Our findings reveal that models systematically violate their own introspective rules. For example, GPT-5-mini violates its stated introspection rules in nearly 60% of cases on objects with strong color priors. Human participants remain faithful to their stated rules, with any apparent violations being explained by a well-documented tendency to overestimate color coverage. In contrast, we find that VLMs are excellent estimators of color coverage, yet blatantly contradict their own reasoning in their final responses. Across all models and strategies for eliciting introspective rules, world-knowledge priors systematically degrade faithfulness in ways that do not mirror human cognition. Our findings challenge the view that VLM reasoning failures are difficulty-driven and suggest that VLM introspective self-knowledge is miscalibrated, with direct implications for high-stakes deployment.

关键词: Vision-Language Models, introspective reasoning, faithfulness, color attribution, world-knowledge priors, self-knowledge calibration, reasoning failures, trustworthy deployment

8. ✅ Does a Global Perspective Help Prune Sparse MoEs Elegantly?

作者: Zeliang Zhang, Nikhil Ghosh, Jiani Liu, Bin Yu, Xiaodong Liu 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06542v1

评分: 41.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	15.0/10	15.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	8.0/10	8.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	8.0/10	8.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文针对稀疏混合专家模型（MoE）内存消耗大的问题，提出了一种全局冗余感知剪枝策略（GRAPE），动态分配剪枝预算，实验表明在相同预算下，GRAPE在多个MoE模型上比局部基线方法平均提升1.40%的准确率。

摘要翻译

语言模型的实证缩放定律推动了大型语言模型（LLM）规模的持续扩大，尽管其计算和内存成本不断增长。稀疏专家混合模型（Sparse Mixture-of-Experts, MoE）提供了一种前景广阔的替代方案，它仅在每次前向传播中激活一部分专家，从而在不牺牲性能的前提下提升了效率。然而，大量的专家参数仍导致显著的内存消耗。
现有的剪枝方法通常在各层间均匀分配预算，忽视了稀疏MoE中产生的异构冗余。我们提出了GRAPE（面向专家的全局冗余感知剪枝），这是一种全局剪枝策略，能够基于跨层冗余动态分配剪枝预算。在Mixtral-8x7B、Mixtral-8x22B、DeepSeek-MoE、Qwen-MoE和GPT-OSS上的实验表明，在相同剪枝预算下，GRAPE始终能取得最佳的平均性能。在论文报告的三个主要模型上，相较于最强的局部基线方法，GRAPE在不同剪枝设置下的平均准确率平均提升了1.40%，最高提升可达2.45%。

摘要 (Abstract)

Empirical scaling laws for language models have encouraged the development of ever-larger LLMs, despite their growing computational and memory costs. Sparse Mixture-of-Experts (MoEs) offer a promising alternative by activating only a subset of experts per forward pass, improving efficiency without sacrificing performance. However, the large number of expert parameters still leads to substantial memory consumption. Existing pruning methods typically allocate budgets uniformly across layers, overlooking the heterogeneous redundancy that arises in sparse MoEs. We propose GRAPE (Global Redundancy-Aware Pruning of Experts, a global pruning strategy that dynamically allocates pruning budgets based on cross-layer redundancy. Experiments on Mixtral-8x7B, Mixtral-8x22B, DeepSeek-MoE, Qwen-MoE, and GPT-OSS show that, under the same pruning budget, GRAPE consistently achieves the best average performance. On the three main models reported in the paper, it improves average accuracy over the strongest local baseline by 1.40% on average across pruning settings, with gains of up to 2.45%.

关键词: Sparse Mixture-of-Experts, MoE pruning, Global pruning strategy, Model compression, Memory efficiency, Cross-layer redundancy, GRAPE, Large language models

9. ✅ A Systematic Study of Retrieval Pipeline Design for Retrieval-Augmented Medical Question Answering

作者: Nusrat Sultana, Abdullah Muhammad Moosa, Kazi Afzalur Rahman, Sajal Chandra Banik 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07274v1

评分: 40.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	15.0/10	15.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	5.0/10	5.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

!!! tip deepseek-chat TL;DR

该论文系统研究了检索增强生成（RAG）在医学问答中的应用，通过评估40种配置发现检索增强显著提升了零样本医学问答性能，最佳配置达到60.49%准确率，并揭示了检索效果与计算成本之间的权衡。

摘要翻译

大语言模型（LLM）在医学问答任务中展现出强大能力；然而，纯参数化模型常受限于知识缺口与事实依据不足。检索增强生成（RAG）通过将外部知识检索整合至推理过程，有效应对了这一局限。尽管基于RAG的医学系统日益受到关注，但各检索组件对系统性能的具体影响仍未得到充分理解。本研究基于MedQA USMLE基准与结构化教科书知识库，对检索增强型医学问答进行了系统性评估。我们在包含四十种配置的统一实验框架内，分析了语言模型、嵌入模型、检索策略、查询重构与交叉编码器重排序之间的交互作用。结果表明，检索增强显著提升了零样本医学问答性能。最佳配置方案——结合查询重构与重排序的稠密检索——实现了60.49%的准确率。研究还发现，领域专用语言模型比通用模型能更有效地利用检索到的医学证据。进一步分析揭示了检索效能与计算成本间的明确权衡：较简单的稠密检索配置在保持高吞吐量的同时，仍能提供强劲性能。所有实验均在单张消费级GPU上完成，证明检索增强型医学问答系统的系统性评估可在适度计算资源下实现。

摘要 (Abstract)

Large language models (LLMs) have demonstrated strong capabilities in medical question answering; however, purely parametric models often suffer from knowledge gaps and limited factual grounding. Retrieval-augmented generation (RAG) addresses this limitation by integrating external knowledge retrieval into the reasoning process. Despite increasing interest in RAG-based medical systems, the impact of individual retrieval components on performance remains insufficiently understood. This study presents a systematic evaluation of retrieval-augmented medical question answering using the MedQA USMLE benchmark and a structured textbook-based knowledge corpus. We analyze the interaction between language models, embedding models, retrieval strategies, query reformulation, and cross-encoder reranking within a unified experimental framework comprising forty configurations. Results show that retrieval augmentation significantly improves zero-shot medical question answering performance. The best-performing configuration was dense retrieval with query reformulation and reranking achieved 60.49% accuracy. Domain-specialized language models were also found to better utilize retrieved medical evidence than general-purpose models. The analysis further reveals a clear tradeoff between retrieval effectiveness and computational cost, with simpler dense retrieval configurations providing strong performance while maintaining higher throughput. All experiments were conducted on a single consumer-grade GPU, demonstrating that systematic evaluation of retrieval-augmented medical QA systems can be performed under modest computational resources.

关键词: Retrieval-Augmented Generation, Medical Question Answering, Large Language Models, Retrieval Pipeline, MedQA, Zero-shot Learning, Dense Retrieval, Query Reformulation

10. ✅ Geometric Properties of the Voronoi Tessellation in Latent Semantic Manifolds of Large Language Models

作者: Marshall Brett 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06767v1

评分: 40.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	10.0/10	10.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	5.0/10	5.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	10.0/10	10.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型（Qwen3.5-4B-Base）潜在语义流形中的Voronoi镶嵌几何特性，验证了表达性差距的线性缩放定律，并提出了一种通过边缘优化程序进行几何重组的方法，可在不重新训练的情况下改善模型表示。

摘要翻译

语言模型在离散的标记上操作，但在连续的向量空间中计算，从而在表示流形上诱导出沃罗诺伊镶嵌结构。本研究基于Qwen3.5-4B-Base模型对此镶嵌结构进行了实证分析，并做出两项贡献。首先，通过采用float32边界重计算以消除bfloat16量化伪影，我们验证了Mabrok（2026）提出的表达能力间隙线性缩放定律（$R^2$ = 0.9997），这是迄今为止最强有力的证实；同时，我们发现了一个中间层的几何模糊区域（第24-28层，$ρ$ = -0.29），其中边界几何结构与交叉熵呈负相关，直至最终层才结晶为对齐状态（$ρ$ = 0.836）。
其次，我们证明收敛模型的沃罗诺伊镶嵌结构可通过边界优化程序进行重塑：这是一种无需重新训练的简短事后优化过程，旨在拓宽标记决策边界。我们在剂量反应扫描中比较了直接边界最大化与费舍尔信息距离最大化两种方法。两种方法均发现，在每评估256K个位置中，可修正位置的上限约为16,300个，但其关键差异在于附带损害。边界最大化的损害随干预强度增加而加剧，直至修正效果被淹没；而费舍尔方法在验证范围（$λ$ = 0.15-0.6）内损害保持恒定（约5,300个位置），在$λ$ = 0.6时实现中值边界提升+28%，且下游基准测试结果保持不变——这是一种压缩表达能力间隙同时保留其缩放定律的几何重组。然而，频率与标记类别审计显示，增益集中于高频结构标记（$λ$ = 0.6时净修正量的84%），而内容类及实体类标记的贡献随$λ$升高而缩减。因此，费舍尔边界优化程序是一种可行的几何精修工具，其实际上限并非由总体损害决定，而是取决于标记级收益的均匀性。

摘要 (Abstract)

Language models operate on discrete tokens but compute in continuous vector spaces, inducing a Voronoi tessellation over the representation manifold. We study this tessellation empirically on Qwen3.5-4B-Base, making two contributions. First, using float32 margin recomputation to resolve bfloat16 quantization artifacts, we validate Mabrok’s (2026) linear scaling law of the expressibility gap with $R^2$ = 0.9997 - the strongest confirmation to date - and identify a mid-layer geometric ambiguity regime where margin geometry is anti-correlated with cross-entropy (layers 24-28, $ρ$ = -0.29) before crystallizing into alignment at the final layer ($ρ$ = 0.836). Second, we show that the Voronoi tessellation of a converged model is reshapable through margin refinement procedures (MRP): short post-hoc optimization runs that widen token-decision margins without retraining. We compare direct margin maximization against Fisher information distance maximization across a dose-response sweep. Both methods find the same ceiling of ~16,300 correctable positions per 256K evaluated, but differ critically in collateral damage. Margin maximization damage escalates with intervention strength until corrections are overwhelmed. Fisher damage remains constant at ~5,300 positions across the validated range ($λ$ = 0.15-0.6), achieving +28% median margin improvement at $λ$ = 0.6 with invariant downstream benchmarks - a geometric reorganization that compresses the expressibility gap while preserving its scaling law. However, frequency and token-class audits reveal that gains concentrate in high-frequency structural tokens (84% of net corrections at $λ$ = 0.6), with content and entity-like contributions shrinking at higher $λ$. Fisher MRP is therefore a viable geometric polishing tool whose practical ceiling is set not by aggregate damage but by the uniformity of token-level benefit.

关键词: Large Language Models, Voronoi tessellation, latent semantic manifolds, scaling laws, expressibility gap, margin refinement, geometric properties, model interpretability

11. ✅ StructKV: Preserving the Structural Skeleton for Scalable Long-Context Inference

作者: Zhirui Chen, Peiyang Liu, Ling Shao 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06746v1

评分: 40.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	10.0/10	10.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	10.0/10	10.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	5.0/10	5.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	5.0/10	5.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

论文针对大语言模型在长上下文推理中KV缓存线性增长导致的内存瓶颈问题，提出了StructKV结构感知压缩框架，通过全局中心性度量和动态层检测有效保持了长距离依赖和检索鲁棒性。

摘要翻译

随着大语言模型（LLM）扩展至支持超过百万令牌的上下文窗口，键值（KV）缓存的线性增长带来了严重的内存容量与带宽瓶颈，制约了长上下文推理的效率。现有的压缩方法通常基于局部显著性度量对令牌进行优先级排序，以将预填充计算与解码内存解耦。然而，这些方法往往依赖于特定网络层的局部显著性快照，从而系统性地丢弃了那些在整个网络深度中充当全局信息枢纽、但在选定进行剪枝的特定层上暂时处于休眠状态的令牌。为解决这一局限，我们提出了StructKV，一种结构感知的KV缓存压缩框架，其引入了三项核心创新：首先，全局入度中心性通过聚合网络深度上的注意力模式来识别全局信息枢纽。其次，动态枢纽检测利用信息论度量自适应地定位最佳压缩层。最后，结构传播与解耦将计算预算与内存存储预算分离。在LongBench和RULER基准测试上的实验结果表明，StructKV能有效保持长程依赖性与检索鲁棒性。

摘要 (Abstract)

As Large Language Models (LLMs) scale to support context windows exceeding one million tokens, the linear growth of Key-Value (KV) cache imposes severe memory capacity and bandwidth bottlenecks, constraining the efficiency of long-context inference. Existing compression approaches typically prioritize tokens based on local saliency metrics to decouple prefill computation from decoding memory. However, these methods often rely on local saliency snapshots at a specific layer, thereby systematically discarding tokens that act as global information hubs across the network depth but appear temporarily dormant at the specific layer selected for pruning. To address this limitation, we propose StructKV, a structure-aware KV cache compression framework that introduces three core innovations: First, Global In-Degree Centrality aggregates attention patterns across the network depth to identify global information hubs. Second, Dynamic Pivot Detection utilizes information-theoretic metrics to adaptively locate the optimal layer for compression. Finally, Structural Propagation and Decoupling separates the computational budget from the memory storage budget. Experimental results on the LongBench and RULER benchmarks demonstrate that StructKV effectively preserves long-range dependencies and retrieval robustness.

关键词: Large Language Models, KV Cache Compression, Long Context Inference, Memory Bottleneck, Structural Skeleton, Global Information Hubs, Attention Patterns, Inference Efficiency

12. ✅ Multi-Turn Reasoning LLMs for Task Offloading in Mobile Edge Computing

作者: Ning Yang, Chuangxin Cheng, Haijun Zhang 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07148v1

评分: 40.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	10.0/10	10.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文针对移动边缘计算中动态任务卸载策略设计困难的问题，提出了COMLLM框架，通过集成GRPO和前瞻协作模拟机制，实现了近最优延迟和零样本拓扑可扩展性，优于SFT、DRL和启发式基线方法。

摘要翻译

新兴计算密集型应用对资源受限的移动设备提出了严格的延迟要求。移动边缘计算（Mobile Edge Computing, MEC）通过任务卸载应对这一挑战。然而，由于动态任务到达、时变信道以及服务器队列的时空耦合特性，设计高效策略仍然困难。传统启发式方法缺乏适应性，而深度强化学习（Deep Reinforcement Learning, DRL）存在泛化能力有限和架构僵化的问题，当网络拓扑变化时需要重新训练。尽管大语言模型（Large Language Models, LLMs）具备语义推理能力，但标准的监督微调（Supervised Fine-Tuning, SFT）会产生短视策略，仅贪婪地最小化即时延迟，而未考虑系统的长期演化。为克服这些局限，我们提出了COMLLM，一个支持在MEC系统中进行前瞻性决策的生成式框架。COMLLM将组相对策略优化（Group Relative Policy Optimization, GRPO）与前瞻协同仿真（Look-Ahead Collaborative Simulation, LACS）机制相结合，该机制在执行多步蒙特卡洛推演的同时联合建模服务器队列动态。通过将这些推演纳入奖励设计，框架能够捕捉当前决策对未来系统状态的长期影响。实验结果表明，COMLLM实现了接近最优的延迟并提升了负载均衡的公平性。值得注意的是，该框架展现出零样本拓扑可扩展性，使得在小型网络上训练的模型能够无需重新训练即可泛化到更大、未见过的拓扑结构，其性能优于SFT、DRL及启发式基线方法。

摘要 (Abstract)

Emerging computation-intensive applications impose stringent latency requirements on resource-constrained mobile devices. Mobile Edge Computing (MEC) addresses this challenge through task offloading. However, designing effective policies remains difficult due to dynamic task arrivals, time-varying channels, and the spatio-temporal coupling of server queues. Conventional heuristics lack adaptability, while Deep Reinforcement Learning (DRL) suffers from limited generalization and architectural rigidity, requiring retraining when network topology changes. Although Large Language Models (LLMs) offer semantic reasoning capabilities, standard Supervised Fine-Tuning (SFT) yields myopic policies that greedily minimize immediate latency without accounting for long-term system evolution. To address these limitations, we propose COMLLM, a generative framework that enables foresighted decision-making in MEC systems. COMLLM integrates Group Relative Policy Optimization (GRPO) with a Look-Ahead Collaborative Simulation (LACS) mechanism, which performs multi-step Monte Carlo rollouts while jointly modeling server queue dynamics. By incorporating these rollouts into the reward design, the framework captures the long-term impact of current decisions on future system states. Experimental results demonstrate that COMLLM achieves near-optimal latency and improved load-balancing fairness. Notably, it exhibits zero-shot topological scalability, allowing a model trained on small-scale networks to generalize to larger, unseen topologies without retraining, outperforming SFT, DRL, and heuristic baselines.

关键词: Large Language Models, Mobile Edge Computing, Task Offloading, Monte Carlo Rollouts, Supervised Fine-Tuning, Multi-step Reasoning, Zero-shot Generalization, Latency Optimization

13. ✅ An empirical study of LoRA-based fine-tuning of large language models for automated test case generation

作者: Milad Moradi, Ke Yan, David Colwell, Rhona Asgari 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06946v1

评分: 35.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	15.0/10	15.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文通过实证研究，证明了使用LoRA对大型语言模型进行参数高效微调，能显著提升从自然语言需求生成自动化测试用例的性能，并使开源模型达到接近专有模型的水平。

摘要翻译

基于自然语言需求自动生成测试用例仍是软件工程中的一个挑战性问题，这源于需求的模糊性以及需要生成结构化、可执行的测试制品。大语言模型的最新进展为解决该任务带来了希望；然而，其有效性取决于针对特定任务的适应性和高效的微调策略。本文对使用参数高效微调方法（特别是LoRA）进行基于需求的测试用例生成展开了全面的实证研究。我们在统一的实验流程下评估了包括开源模型和专有模型在内的多个大语言模型家族。该研究系统性地探讨了LoRA关键超参数（包括秩、缩放因子和丢弃率）对下游性能的影响。我们提出了一个基于GPT-4o的自动化评估框架，从九个质量维度对生成的测试用例进行评估。实验结果表明，基于LoRA的微调显著提升了所有开源模型的性能，其中Ministral-8B模型取得了最佳效果。此外，我们发现经过微调的8B开源模型可以达到与未经微调的GPT-4.1模型相当的性能，这凸显了参数高效适应方法的有效性。虽然GPT-4.1模型取得了最高的整体性能，但微调后专有模型与开源模型之间的性能差距显著缩小。这些发现为自动化测试生成的模型选择、微调策略和评估方法提供了重要见解。特别地，研究证明，结合精心设计的微调方法，具有成本效益、可本地部署的开源模型能够成为专有系统的可行替代方案。

摘要 (Abstract)

Automated test case generation from natural language requirements remains a challenging problem in software engineering due to the ambiguity of requirements and the need to produce structured, executable test artifacts. Recent advances in LLMs have shown promise in addressing this task; however, their effectiveness depends on task-specific adaptation and efficient fine-tuning strategies. In this paper, we present a comprehensive empirical study on the use of parameter-efficient fine-tuning, specifically LoRA, for requirement-based test case generation. We evaluate multiple LLM families, including open-source and proprietary models, under a unified experimental pipeline. The study systematically explores the impact of key LoRA hyperparameters, including rank, scaling factor, and dropout, on downstream performance. We propose an automated evaluation framework based on GPT-4o, which assesses generated test cases across nine quality dimensions. Experimental results demonstrate that LoRA-based fine-tuning significantly improves the performance of all open-source models, with Ministral-8B achieving the best results among them. Furthermore, we show that a fine-tuned 8B open-source model can achieve performance comparable to pre-fine-tuned GPT-4.1 models, highlighting the effectiveness of parameter-efficient adaptation. While GPT-4.1 models achieve the highest overall performance, the performance gap between proprietary and open-source models is substantially reduced after fine-tuning. These findings provide important insights into model selection, fine-tuning strategies, and evaluation methods for automated test generation. In particular, they demonstrate that cost-efficient, locally deployable open-source models can serve as viable alternatives to proprietary systems when combined with well-designed fine-tuning approaches.

关键词: LoRA, parameter-efficient fine-tuning, large language models, automated test case generation, empirical study, open-source models, GPT-4, evaluation framework

14. ✅ SentinelSphere: Integrating AI-Powered Real-Time Threat Detection with Cybersecurity Awareness Training

作者: Nikolaos D. Tantaroudas, Ilias Karachalios, Andrew J. McCracken 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06900v1

评分: 35.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	5.0/10	5.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	5.0/10	5.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	10.0/10	10.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该研究针对网络安全领域人才短缺和人为因素漏洞问题，提出了SentinelSphere平台，通过集成基于深度学习的威胁检测和LLM驱动的安全培训，实验证明能有效提高检测准确性并降低误报，同时验证了其教育组件对非技术用户的实用性。

摘要翻译

网络安全领域面临两大相互关联的挑战：全球范围内合格从业人员的短缺，以及持续存在的人为因素弱点——后者是大多数安全事件的主要原因。为应对这些问题，我们提出了SentinelSphere，这是一个由人工智能驱动的平台，它将基于机器学习的威胁识别与大型语言模型（LLM）驱动的安全培训相结合。其检测模块采用一个增强型深度神经网络（DNN），该网络在CIC-IDS2017和CIC-DDoS2019基准数据集上训练，并辅以新颖的HTTP层特征工程，以捕捉应用层攻击特征。在教育组件方面，我们部署了Phi-4模型的量化变体（Q4_K_M），该模型针对网络安全领域进行了微调，使其能够在仅需16 GB RAM、无需专用GPU资源的商用硬件上部署。实验结果表明，相较于基线模型，增强型DNN在保持高检测准确率的同时，显著降低了误报率，并且在DDoS、暴力破解及基于网络的漏洞利用等关键攻击类别上保持了强大的召回率。涉及行业专业人士和大学生的验证研讨会证实，其“交通灯”可视化系统和对话式AI助手对于非技术背景用户而言既直观又有效。SentinelSphere表明，将智能威胁检测与自适应的、LLM驱动的安全教育相结合，能够在一个统一、连贯的框架内有意义地应对技术和人为因素两方面的网络安全漏洞。

摘要 (Abstract)

The field of cybersecurity is confronted with two interrelated challenges: a worldwide deficit of qualified practitioners and ongoing human-factor weaknesses that account for the bulk of security incidents. To tackle these issues, we present SentinelSphere, a platform driven by artificial intelligence that unifies machine learning-based threat identification with security training powered by a Large Language Model (LLM). The detection module uses an Enhanced Deep Neural Network (DNN) trained on the CIC-IDS2017 and CIC-DDoS2019 benchmark datasets, enriched with novel HTTP-layer feature engineering that captures application level attack signatures. For the educational component, we deploy a quantised variant of Phi-4 model (Q4_K_M), fine-tuned for the cybersecurity domain, enabling deployment on commodity hardware requiring only 16 GB of RAM without dedicated GPU resources. Experimental results show that the Enhanced DNN attains high detection accuracy while substantially lowering false positives relative to baseline models, and maintains strong recall across critical attack categories such as DDoS, brute force, and web-based exploits. Validation workshops involving industry professionals and university students confirmed that the Traffic Light visualisation system and conversational AI assistant are both intuitive and effective for users without technical backgrounds. SentinelSphere illustrates that coupling intelligent threat detection with adaptive, LLM-driven security education can meaningfully address both technical and human-factor cybersecurity vulnerabilities within a single, cohesive framework.

关键词: cybersecurity, threat detection, Large Language Model, quantization, fine-tuning, DNN, Phi-4, security training

15. ✅ Beyond Surface Judgments: Human-Grounded Risk Evaluation of LLM-Generated Disinformation

作者: Zonghuan Xu, Xiang Zheng, Yutao Wu, Xingjun Ma 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06820v1

评分: 28.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	8.0/10	8.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文研究了LLM作为评估代理在评估LLM生成的虚假信息风险时与人类读者响应的一致性，发现LLM评估者之间高度一致但与人类读者存在显著差距，表明内部一致性不能作为有效代理人类响应的证据。

摘要翻译

大型语言模型（LLM）能够大规模生成具有说服力的叙事，这引发了人们对其可能被用于虚假信息活动的担忧。评估这一风险最终需要理解读者如何接收此类内容。然而在实践中，LLM评判者正日益被用作直接人类评估的低成本替代品，尽管它们是否能够忠实反映读者反应尚不明确。我们将此情境下的评估重新界定为代理效度问题，并依据人类读者反应对LLM评判者进行审计。通过使用290篇校准文章、2,043组配对的人类评分以及八个前沿评判模型的输出，我们从整体评分、项目级排序和信号依赖性三个方面检验了评判者与人类的一致性。我们发现评判者与人类之间存在持续性的差距。相较于人类，评判者通常更为严苛，仅能微弱地复现项目级的人类排序，并且依赖不同的文本信号——更注重逻辑严谨性，同时对情感强度给予更强的负面评价。与此同时，评判者彼此之间的共识度远高于其与人类读者的共识度。这些结果表明，LLM评判者形成了一个内部高度一致的评价群体，其内部一致性远高于其与人类读者的一致性，这表明内部共识并不能作为其作为读者反应代理的有效性证据。

摘要 (Abstract)

Large language models (LLMs) can generate persuasive narratives at scale, raising concerns about their potential use in disinformation campaigns. Assessing this risk ultimately requires understanding how readers receive such content. In practice, however, LLM judges are increasingly used as a low-cost substitute for direct human evaluation, even though whether they faithfully track reader responses remains unclear. We recast evaluation in this setting as a proxy-validity problem and audit LLM judges against human reader responses. Using 290 aligned articles, 2,043 paired human ratings, and outputs from eight frontier judges, we examine judge–human alignment in terms of overall scoring, item-level ordering, and signal dependence. We find persistent judge–human gaps throughout. Relative to humans, judges are typically harsher, recover item-level human rankings only weakly, and rely on different textual signals, placing more weight on logical rigour while penalizing emotional intensity more strongly. At the same time, judges agree far more with one another than with human readers. These results suggest that LLM judges form a coherent evaluative group that is much more aligned internally than it is with human readers, indicating that internal agreement is not evidence of validity as a proxy for reader response.

关键词: Large Language Models, LLM-generated disinformation, human evaluation, proxy-validity, judge-human alignment, risk assessment, reader response, evaluative gap

16. ✅ Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization

作者: Qiyao Ma, Dechen Gao, Rui Cai, Boqi Zhao, Hanchu Zhou, Junshan Zhang, Zhe Zhao 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07343v1

评分: 28.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	10.0/10	10.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	8.0/10	8.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文提出了Personalized RewardBench基准，用于评估奖励模型在捕捉个性化用户偏好方面的能力，发现现有奖励模型在个性化任务上表现不佳（最高准确率75.94%），且该基准能更准确地预测下游任务性能。

摘要翻译

多元对齐已成为大型语言模型发展的关键前沿领域，其中奖励模型作为捕捉多样化人类价值观的核心机制。尽管针对通用响应质量的基准测试已较为普遍，但如何评估奖励模型对个体用户偏好的建模能力仍是一个开放挑战。为填补这一空白，我们提出了个性化奖励基准——一种旨在严格评估奖励模型建模个性化偏好能力的新型基准。我们基于对用户特定准则的严格遵守（或违反）构建了优选与拒选响应配对，确保偏好区分完全针对个体量身定制。特别值得注意的是，人工评估证实配对间的主要区分因素严格限于个人偏好，且两种响应均保持较高的通用质量（如正确性、相关性和帮助性）。广泛测试表明，现有最先进的奖励模型在个性化任务上表现显著不足，最高准确率仅为75.94%。关键的是，由于有效的奖励模型基准应能预测其在下游任务中的表现，我们通过实验证明：与现有基线相比，该基准在最佳N采样和近端策略优化两种下游任务中，与奖励模型性能的相关性均显著更高。这些发现确立了个性化奖励基准作为评估奖励模型下游应用性能的稳健且精确的代理标准。

摘要 (Abstract)

Pluralistic alignment has emerged as a critical frontier in the development of Large Language Models (LLMs), with reward models (RMs) serving as a central mechanism for capturing diverse human values. While benchmarks for general response quality are prevalent, evaluating how well reward models account for individual user preferences remains an open challenge. To bridge this gap, we introduce Personalized RewardBench, a novel benchmark designed to rigorously assess reward models’ capacity to model personalized preferences. We construct chosen and rejected response pairs based on strict adherence to (or violation of) user-specific rubrics, ensuring that preference distinctions are uniquely tailored to the individual. In particular, human evaluations confirm that the primary discriminative factor between pairs is strictly personal preference, with both responses maintaining high general quality (e.g., correctness, relevance and helpfulness). Extensive testing reveals that existing state-of-the-art reward models struggle significantly with personalization, peaking at an accuracy of just 75.94%. Crucially, because an effective reward model benchmark should predict a reward model’s performance on downstream tasks, we conduct experiments demonstrating that our benchmark exhibits a significantly higher correlation with downstream performance in both Best-of-N (BoN) sampling and Proximal Policy Optimization (PPO) compared to existing baselines. These findings establish Personalized RewardBench as a robust and accurate proxy for evaluating reward models’ performance in downstream applications.

关键词: Personalized RewardBench, reward models, pluralistic alignment, human preferences, benchmark evaluation, downstream performance, Best-of-N sampling, Proximal Policy Optimization

17. ❌ Adaptive Prompt Structure Factorization: A Framework for Self-Discovering and Optimizing Compositional Prompt Programs

作者: Haoyue Liu, Zhichao Wang, Yongxin Guo, Haoran Shou, Xiaoying Tang 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06699v1

评分: 26.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	8.0/10	8.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	8.0/10	8.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLM的提示优化框架aPSF，通过分解提示结构为语义因子并进行干预性更新，以提升推理性能。因此，与"Large Language Models"高度相关（10分），因为论文明确使用LLMs并针对其进行优化。与"Chain of Thought"和"System 2 Thinking"有一定关联（8分），因为论文在高级推理基准上测试aPSF，涉及多步推理和深度思考过程，但并非直接研究CoT或System 2本身。其他关键词如MoE、SFT、RAG等与论文内容无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个名为自适应提示结构分解（aPSF）的框架，用于自动发现和优化LLM的提示程序，以解决现有提示优化方法中组件耦合和信用分配不明确的问题，从而在多个高级推理基准上提高了准确性并显著降低了优化成本。

摘要翻译

自动化提示优化对于从大语言模型（LLM）中激发可靠推理至关重要，然而大多数仅依赖API的提示优化器会迭代式地编辑整体提示，导致各组件相互耦合、归因模糊，从而限制了可控性并造成令牌浪费。我们提出自适应提示结构因子化（Adaptive Prompt Structure Factorization，aPSF），这是一个仅使用API的框架（提示输入/文本输出；无需访问模型内部），它利用一个架构模型来发现任务特定的提示结构作为语义因子。aPSF随后执行干预式的单因子更新：干预性因子级评分通过验证性能的变化估计每个因子的边际贡献，而误差引导的因子选择则将更新定向至当前主要失败来源，从而实现更高效的样本优化。在多个高级推理基准测试中，aPSF超越了包括具备原理感知能力的优化器在内的强基线，平均准确率最高提升2.16个百分点，并在MultiArith任务上减少45%至87%的令牌优化成本，同时仅需一步即可达到峰值验证效果。

摘要 (Abstract)

Automated prompt optimization is crucial for eliciting reliable reasoning from large language models (LLMs), yet most API-only prompt optimizers iteratively edit monolithic prompts, coupling components and obscuring credit assignment, limiting controllability, and wasting tokens. We propose Adaptive Prompt Structure Factorization (aPSF), an API-only framework (prompt-in/text-out; no access to model internals) that uses an Architect model to discover task-specific prompt structures as semantic factors. aPSF then performs interventional, single-factor updates: interventional factor-level scoring estimates each factor’s marginal contribution via validation-performance changes, and error-guided factor selection routes updates to the current dominant failure source for more sample-efficient optimization. Across multiple advanced reasoning benchmarks, aPSF outperforms strong baselines including principle-aware optimizers, improving accuracy by up to +2.16 percentage points on average, and reduces optimization cost by 45–87% tokens on MultiArith while reaching peak validation in 1 step.

关键词: prompt optimization, large language models, reasoning benchmarks, adaptive prompt structure factorization, interventional factor-level scoring, semantic factors, API-only framework, validation-performance

18. ❌ Steering the Verifiability of Multimodal AI Hallucinations

作者: Jianhong Pang, Ruoxi Cheng, Ziyi Ye, Xingjun Ma, Zuxuan Wu, Xuanjing Huang, Yu-Gang Jiang 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06714v1

评分: 25.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	10.0/10	10.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文研究多模态大语言模型（MLLMs）的幻觉问题，特别是幻觉的可验证性控制。核心相关关键词是"Large Language Models"（论文明确研究MLLMs）和"Hallucination Mitigation"（论文直接研究幻觉检测和控制）。“Explainable AI"有一定关联，因为论文通过激活空间干预方法分析模型行为，但并非核心解释性AI研究。其他关键词如MoE、SFT、RAG等均未在摘要中提及或相关。

!!! tip deepseek-chat TL;DR

该论文研究了多模态大语言模型（MLLMs）中幻觉的可验证性问题，提出了一种基于激活空间干预的方法来区分和控制明显与难以察觉的幻觉，并通过实验验证了该方法的有效性。

摘要翻译

由多模态大语言模型驱动的AI应用易产生幻觉现象，并对人类用户构成显著风险。关键在于，此类幻觉并非同等严重：部分幻觉内容可被人类用户识别（即明显幻觉），而另一些则常被忽略或需更多验证成本（即隐蔽幻觉）。这表明多模态AI幻觉的可验证性存在显著差异。然而，现有研究鲜少探索如何针对不同安全性与可用性需求的AI应用控制这一特性。为填补此空白，我们基于4,470份人类对AI生成幻觉的反馈构建数据集，并依据人类用户的可验证性将其归类为明显与隐蔽两种类型。进一步，我们提出一种激活空间干预方法，通过学习针对明显幻觉与隐蔽幻觉的独立探针，揭示二者引发不同的干预探针，从而实现对模型可验证性的细粒度调控。实证结果表明该方法的有效性，并证明定向干预在调节相应可验证性方面具有优越性能。此外，仅通过混合这些干预策略即可灵活适配不同场景所需的可验证性水平。

摘要 (Abstract)

AI applications driven by multimodal large language models (MLLMs) are prone to hallucinations and pose considerable risks to human users. Crucially, such hallucinations are not equally problematic: some hallucination contents could be detected by human users(i.e., obvious hallucinations), while others are often missed or require more verification effort(i.e., elusive hallucinations). This indicates that multimodal AI hallucinations vary significantly in their verifiability. Yet, little research has explored how to control this property for AI applications with diverse security and usability demands. To address this gap, we construct a dataset from 4,470 human responses to AI-generated hallucinations and categorize these hallucinations into obvious and elusive types based on their verifiability by human users. Further, we propose an activation-space intervention method that learns separate probes for obvious and elusive hallucinations. We reveal that obvious and elusive hallucinations elicit different intervention probes, allowing for fine-grained control over the model’s verifiability. Empirical results demonstrate the efficacy of this approach and show that targeted interventions yield superior performance in regulating corresponding verifiability. Moreover, simply mixing these interventions enables flexible control over the verifiability required for different scenarios.

关键词: Multimodal Large Language Models, Hallucinations, Verifiability, Activation-space Intervention, Obvious Hallucinations, Elusive Hallucinations, Human Responses, Fine-grained Control

19. ❌ FedDetox: Robust Federated SLM Alignment via On-Device Data Sanitization

作者: Shunan Zhu, Jiawei Chen, Yonghao Yu, Hideya Ochiai 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06833v1

评分: 25.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	10.0/10	10.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	10.0/10	10.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文FedDetox专注于联邦学习环境下的小语言模型（SLMs）安全对齐问题，核心贡献在于通过设备端数据净化来防止意外数据中毒。因此，与"Small Language Models (SLMs)“高度相关（10分），因为论文明确针对SLMs在资源受限设备上的应用；与"Alignment"高度相关（10分），因为论文核心是解决联邦对齐中的安全对齐问题；与"Large Language Models (LLMs)“有一定关联（5分），因为论文提到使用大规模安全对齐的教师模型进行知识蒸馏，但LLMs本身不是研究重点。其他关键词如MoE、Scaling Laws、RLHF、RAG等均未在论文中涉及，因此得0分。

!!! tip deepseek-chat TL;DR

该论文针对联邦学习中小语言模型在资源受限设备上因用户数据包含有害内容而导致的安全对齐受损问题，提出了FedDetox框架，通过设备端数据净化和知识蒸馏技术，在保持模型通用性能的同时实现了与集中式基线相当的安全水平。

摘要翻译

随着高质量公共数据日益稀缺，联邦学习（FL）为在保护隐私的同时利用有价值的私有用户数据提供了关键路径。然而，现实场景中的客户端数据常包含有害或不安全信息。这引发了一个我们定义为非预期数据投毒的关键问题，其可能在联邦对齐过程中严重破坏全局模型的安全性对齐。为解决该问题，我们提出FedDetox——一个专为资源受限边缘设备上的轻量化语言模型（SLMs）设计的鲁棒性框架。我们首先采用知识蒸馏技术，将大规模安全对齐教师模型中复杂的安全对齐能力迁移至适用于资源受限边缘设备的轻量级学生分类器。具体而言，在面向人类偏好对齐的联邦学习过程中，边缘客户端在数据源头识别不安全样本，并将其替换为拒绝模板，从而将潜在毒性数据有效转化为正向安全信号。实验表明，我们的方法能在保持模型通用性能的前提下，将模型安全性维持在接近集中式基线模型的水平。

摘要 (Abstract)

As high quality public data becomes scarce, Federated Learning (FL) provides a vital pathway to leverage valuable private user data while preserving privacy. However, real-world client data often contains toxic or unsafe information. This leads to a critical issue we define as unintended data poisoning, which can severely damage the safety alignment of global models during federated alignment. To address this, we propose FedDetox, a robust framework tailored for Small Language Models (SLMs) on resource-constrained edge devices. We first employ knowledge distillation to transfer sophisticated safety alignment capabilities from large scale safety aligned teacher models into light weight student classifiers suitable for resource constrained edge devices. Specifically, during federated learning for human preference alignment, the edge client identifies unsafe samples at the source and replaces them with refusal templates, effectively transforming potential poisons into positive safety signals. Experiments demonstrate that our approach preserves model safety at a level comparable to centralized baselines without compromising general utility.

关键词: Federated Learning, Small Language Models, Safety Alignment, Data Poisoning, On-device Processing, Knowledge Distillation, Edge Devices, Data Sanitization

作者: Xiaoyou Qin, Zhihong Li, Xiaoxiao Cheng 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06663v1

评分: 20.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	5.0/10	5.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	5.0/10	5.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLM在社会科学模拟中的应用，特别是通过受众分割方法恢复异质性，因此与"Large Language Models"高度相关（10分）。论文使用了Mixtral 8x22B模型，该模型基于MoE架构，因此与"Mixture of Experts"有一定关联（5分）。研究涉及LLM模拟社会态度和行为，可视为LLM代理的一种应用，与"LLM Agents"有一定关联（5分）。其他关键词如模型训练技术、推理优化、特定科学领域等均未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该研究针对LLM社会模拟中多样性被简化为"平均人格"的问题，提出了受众分割方法来恢复异质性，并通过实验发现不同分割配置在不同保真度维度上各有优劣，没有单一配置在所有维度上表现最佳。

摘要翻译

大型语言模型（LLM）正日益被用于模拟社会态度与行为，提供可扩展的“硅基样本”以近似人类数据。然而，当前的模拟实践常将多样性简化为“平均人格”，掩盖了作为社会现实核心的亚群体差异。本研究引入受众细分作为系统性方法，以恢复基于LLM的社会模拟中的异质性。利用美国气候意见调查数据，我们在两个开源权重LLM（Llama 3.1-70B 和 Mixtral 8x22B）上比较了六种细分配置，分别调整了细分标识符的粒度、简洁性及选择逻辑（理论驱动、数据驱动和基于测量工具）。我们通过一个涵盖分布保真度、结构保真度和预测保真度的三维评估框架来评价模拟性能。结果显示，增加标识符粒度并未带来一致的改进：适度丰富标识符可提升性能，但进一步扩展并不能稳定地改善效果，甚至可能损害结构保真度和预测保真度。在简洁性比较中，紧凑配置常能匹配或优于更复杂的替代方案，尤其在结构保真度和预测保真度方面，而分布保真度仍取决于具体度量指标。标识符选择逻辑决定了哪个保真度维度受益最大：基于测量工具的选择最能保持分布形态，而数据驱动的选择最能恢复组间结构及标识符与结果的关联。总体而言，没有任何单一配置能在所有维度上占优，且一个维度的性能提升可能伴随另一维度的损失。这些发现确立了受众细分作为实现有效LLM社会模拟的核心方法论，并强调需要采用关注异质性的评估策略和保持方差的建模方法。

摘要 (Abstract)

Large Language Models (LLMs) are increasingly used to simulate social attitudes and behaviors, offering scalable “silicon samples” that can approximate human data. However, current simulation practice often collapses diversity into an “average persona,” masking subgroup variation that is central to social reality. This study introduces audience segmentation as a systematic approach for restoring heterogeneity in LLM-based social simulation. Using U.S. climate-opinion survey data, we compare six segmentation configurations across two open-weight LLMs (Llama 3.1-70B and Mixtral 8x22B), varying segmentation identifier granularity, parsimony, and selection logic (theory-driven, data-driven, and instrument-based). We evaluate simulation performance with a three-dimensional evaluation framework covering distributional, structural, and predictive fidelity. Results show that increasing identifier granularity does not produce consistent improvement: moderate enrichment can improve performance, but further expansion does not reliably help and can worsen structural and predictive fidelity. Across parsimony comparisons, compact configurations often match or outperform more comprehensive alternatives, especially in structural and predictive fidelity, while distributional fidelity remains metric dependent. Identifier selection logic determines which fidelity dimension benefits most: instrument-based selection best preserves distributional shape, whereas data-driven selection best recovers between-group structure and identifier-outcome associations. Overall, no single configuration dominates all dimensions, and performance gains in one dimension can coincide with losses in another. These findings position audience segmentation as a core methodological approach for valid LLM-based social simulation and highlight the need for heterogeneity-aware evaluation and variance-preserving modeling strategies.

关键词: Large Language Models, social simulation, audience segmentation, heterogeneity, climate-opinion, evaluation framework, distributional fidelity, predictive fidelity

21. ❌ Continuous Interpretive Steering for Scalar Diversity

作者: Ye-eun Cho 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07006v1

评分: 20.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	10.0/10	10.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 该论文研究大语言模型（LLMs）中的语用推理，特别是标量多样性，通过激活导向方法（CIS）和数据集（GraSD）来评估LLMs的语用敏感性。因此，与"Large Language Models"高度相关（10分），因为论文核心是评估LLMs的语用能力；与"Mechanistic Interpretability"高度相关（10分），因为论文通过激活导向来探测LLMs的内部表示和解释性。其他关键词如MoE、SFT、RAG等未在论文中涉及，因此得0分。

!!! tip deepseek-chat TL;DR

该论文研究了如何通过连续解释性导向（CIS）方法和GraSD数据集来评估大语言模型（LLMs）中语用推理的标量多样性，发现激活导向可以系统地恢复LLMs表示空间中的分级敏感性。

摘要翻译

语用推理本质上是渐变的。不同词汇项引发的语用充实程度存在差异，等级含义通过量表多样性体现了这一特性——不同量表项目产生的含义强度各不相同。然而，当前对大语言模型语用推理能力的评估常依赖基于提示词的操控。除提示层面影响外，本研究提出连续解释性导向法，通过将激活层导向强度作为连续实验变量，探究渐变的语用解释现象。为支持此分析，本研究构建了包含渐变量表多样性的新数据集GraSD。在四个大语言模型上的实验表明：均匀激活导向虽能全局提升语用解释倾向，却会消除项目级差异；而渐变激活导向能产生与量表多样性等级相匹配的差异化解释偏移。这证明渐变敏感性已编码于表征空间中，并能通过受控干预系统性地恢复。连续解释性导向法与GraSD数据集共同构成了评估大语言模型渐变语用敏感性的理论框架。

摘要 (Abstract)

Pragmatic inference is inherently graded. Different lexical items give rise to pragmatic enrichment to different degrees. Scalar implicature exemplifies this property through scalar diversity, where implicature strength varies across scalar items. However, evaluations of pragmatic inference in large language models (LLMs) often rely on prompt-based manipulations. Beyond prompt-level effects, this study introduces Continuous Interpretive Steering (CIS), a method that probes graded pragmatic interpretation by treating activation-level steering strength as a continuous experimental variable. To support this analysis, this study introduces a new dataset, GraSD, which encodes graded scalar diversity. Experiments on four LLMs show that uniform activation steering increases pragmatic interpretations globally but collapses item-level variation, whereas graded activation steering yields differentiated interpretive shifts aligned with scalar diversity grades. It indicates that graded sensitivity is encoded in the representation space and can be systematically recovered through controlled intervention. Together, CIS and GraSD provide a principled framework for evaluating graded pragmatic sensitivity in LLMs.

关键词: Large Language Models, LLMs, pragmatic inference, scalar diversity, activation steering, interpretability, graded sensitivity, GraSD dataset

22. ❌ STQuant: Spatio-Temporal Adaptive Framework for Optimizer Quantization in Large Multimodal Model Training

作者: Minglu Liu, Cunchen Hu, Liangliang Xu, Fengming Tang, Ruijia Wang, Fu Yu 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06836v1

评分: 20.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	10.0/10	10.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文STQuant专注于大模型训练中的量化技术，核心贡献是提出一个动态量化框架来减少优化器状态的内存占用。与关键词的相关性分析如下：1) “Quantization"高度相关（10分），因为论文核心就是量化技术；2) “Large Language Models"和"Pre-training"有一定关联（各5分），论文实验使用了GPT-2（LLM）并在训练阶段应用量化；3) 其他关键词如MoE、SFT、RAG等与论文内容无关（0分），论文未涉及这些具体技术。

!!! tip deepseek-chat TL;DR

论文STQuant提出了一种时空自适应框架，通过动态精度分配来减少大模型训练中优化器状态的内存占用，在GPT-2和ViT上实现了84.4%的内存减少，平均位宽低至5.1比特，同时保持模型质量。

摘要翻译

量化是降低大规模模型训练内存开销的有效途径。然而，现有方法大多采用固定精度策略，忽视了优化器状态分布在模型各层与训练步骤间存在显著差异的事实。这种统一的设计往往导致明显的精度损失。为突破固定量化的限制，我们提出了STQuant——一种分布式训练框架，它通过在模型层间、状态变量间及训练步骤间动态分配精度，在保持模型质量的同时显著降低优化器状态的内存占用。在训练中直接应用动态量化面临两大挑战：首先，优化器状态对数值变化敏感，量化噪声可能破坏训练稳定性；其次，同时考虑多状态、多层的量化会形成巨大的组合搜索空间。STQuant通过两项关键技术应对这些挑战：1）一种可证明近似最优的因子选择策略，能精准识别对精度调整影响最大的关键因子；2）一种动态转换决策算法，将搜索成本从指数复杂度降至线性复杂度。在GPT-2和ViT上的实验表明，相较于现有方案，STQuant将优化器状态内存降低84.4%，平均位宽降至5.1比特。此外，STQuant仅产生O(N/K)的计算开销，且仅需O(1)的额外存储空间。

摘要 (Abstract)

Quantization is an effective way to reduce the memory cost of large-scale model training. However, most existing methods adopt fixed-precision policies, which ignore the fact that optimizer-state distributions vary significantly across layers and training steps. Such uniform designs often introduce noticeable accuracy degradation. To move beyond fixed quantization, we propose STQuant, a distributed training framework that reduces the memory footprint of optimizer states via dynamic precision allocation across layers, state variables, and training steps, while maintaining model quality. Naively applying dynamic quantization during training is challenging for two reasons. First, optimizer states are numerically sensitive, and quantization noise can destabilize quality. Second, jointly considering multiple states and layers induces a large combinatorial search space. STQuant addresses these challenges with two key techniques: 1) a provably near-optimal factor selection strategy that accurately identifies the most influential factors for precision adaptation. 2) a dynamic transition decision algorithm that reduces the search cost from exponential to linear complexity. Experiments on GPT-2 and ViT show that STQuant reduces optimizer-state memory by 84.4%, achieving an average bit-width of as low as 5.1 bits, compared with existing solutions. Moreover, STQuant incurs only O(N/K) computational overhead and requires O(1) extra space.

关键词: Quantization, Large Multimodal Model Training, Optimizer States, Memory Reduction, Dynamic Precision Allocation, Distributed Training Framework, GPT-2, ViT

作者: Paula Dodig, Boshko Koloski, Katarina Sitar Šuštar, Senja Pollak, Matthew Purver 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06826v1

评分: 15.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	5.0/10	5.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心是使用大语言模型（LLMs）进行ESG情感分析，因此与"Large Language Models"高度相关（10分）。论文提到对SloBERTa进行微调，这与"Post-training"有一定关联（5分）。其他关键词如MoE、SLMs、Scaling Laws、Instruction Tuning、RAG、Agents等均未在摘要中提及或与论文主题无关，因此得0分。

!!! tip deepseek-chat TL;DR

该论文针对斯洛文尼亚新闻构建了首个公开的ESG情感分析数据集，并评估了多种模型，发现大语言模型在环境和社会方面表现最佳，而微调的SloBERTa在治理分类上最优。

摘要翻译

环境、社会和治理（Environmental, Social, and Governance，简称ESG）考量日益成为评估企业绩效、声誉和长期可持续性的重要组成部分。然而，针对小型企业和新兴市场的可靠ESG评级仍然有限。本文首次引入了公开可用的斯洛文尼亚语ESG情感数据集，以及一套用于自动检测ESG情感的模型。该数据集源自MaCoCu斯洛文尼亚语新闻语料库，结合了大型语言模型（Large Language Model, LLM）辅助过滤与人工标注的公司相关ESG内容。我们评估了单语模型（SloBERTa）和多语模型（XLM-R）、基于嵌入的分类器（TabPFN）、分层集成架构以及大型语言模型的性能。结果显示，在环境（Gemma3-27B，宏观F1分数：0.61）和社会维度（gpt-oss 20B，宏观F1分数：0.45）上，大型语言模型表现最佳；而在治理分类任务中，经过微调的SloBERTa模型表现最优（宏观F1分数：0.54）。随后，我们通过一个小型案例研究展示了性能最佳的分类器（gpt-oss）如何应用于对选定公司在长时间跨度内的ESG表现进行分析。

摘要 (Abstract)

Environmental, Social, and Governance (ESG) considerations are increasingly integral to assessing corporate performance, reputation, and long-term sustainability. Yet, reliable ESG ratings remain limited for smaller companies and emerging markets. We introduce the first publicly available Slovene ESG sentiment dataset and a suite of models for automatic ESG sentiment detection. The dataset, derived from the MaCoCu Slovene news collection, combines large language model (LLM)-assisted filtering with human annotation of company-related ESG content. We evaluate the performance of monolingual (SloBERTa) and multilingual (XLM-R) models, embedding-based classifiers (TabPFN), hierarchical ensemble architectures, and large language models. Results show that LLMs achieve the strongest performance on Environmental (Gemma3-27B, F1-macro: 0.61) and Social aspects (gpt-oss 20B, F1-macro: 0.45), while fine-tuned SloBERTa is the best model on Governance classification (F1-macro: 0.54). We then show in a small case study how the best-preforming classifier (gpt-oss) can be applied to investigate ESG aspects for selected companies across a long time frame.

关键词: ESG sentiment analysis, Slovene news dataset, large language models, SloBERTa fine-tuning, multilingual models, sentiment detection, corporate sustainability, news analysis

24. ❌ On the Price of Privacy for Language Identification and Generation

作者: Xiaoyu Li, Andi Han, Jiaojiao Jiang, Junbin Gao 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07238v1

评分: 10.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文标题和摘要明确聚焦于大语言模型（LLMs）的隐私成本研究，直接匹配第一个关键词"Large Language Models” OR “LLMs” OR “Foundation Models”，因此给予10分。论文研究的是LLMs在训练数据隐私保护（差分隐私）下的理论性能分析，属于大模型技术原理的创新（隐私保护方向），符合研究背景要求。其他关键词涉及模型架构、训练方法、推理优化、应用领域等具体技术或应用，论文未涉及这些具体方面，因此均评为0分。

!!! tip deepseek-chat TL;DR

该论文研究了在差分隐私约束下大语言模型进行语言识别和生成任务时的理论性能损失，发现近似差分隐私下可恢复非私有错误率，而纯差分隐私下性能损失仅为指数中的一个乘性因子。

摘要翻译

随着大语言模型（LLMs）越来越多地基于敏感用户数据进行训练，理解语言学习中隐私的基本代价变得至关重要。我们首次在不可知统计设定下，系统研究了差分隐私（Differentially Private, DP）的语言识别与生成任务，建立了算法并给出了匹配的下界，从而精确量化了隐私的代价。对于这两项任务，采用常数 $\varepsilon > 0$ 的近似 $(\varepsilon, \delta)$-DP 能够恢复非隐私的错误率：识别任务为 $\exp(-r(n))$（对于任意 $r(n) = o(n)$），生成任务为 $\exp(-Ω(n))$。在纯 $\varepsilon$-DP 下，指数项会衰减一个 $\min{1, \varepsilon}$ 的乘法因子，我们证明该因子在常数范围内是紧的。值得注意的是，在温和假设下，对于纯 DP 的生成任务，其上界 $\exp(-\min{1,\varepsilon} \cdot Ω(n))$ 与下界在常数范围内匹配，从而确立了最优速率。我们的结果表明，语言学习中的隐私代价出人意料地温和：在近似 DP 下完全不存在，而在纯 DP 下则恰好表现为指数项上的一个 $\min{1,\varepsilon}$ 因子。

摘要 (Abstract)

As large language models (LLMs) are increasingly trained on sensitive user data, understanding the fundamental cost of privacy in language learning becomes essential. We initiate the study of differentially private (DP) language identification and generation in the agnostic statistical setting, establishing algorithms and matching lower bounds that precisely quantify the cost of privacy. For both tasks, approximate $(\varepsilon, δ)$-DP with constant $\varepsilon > 0$ recovers the non-private error rates: $\exp(-r(n))$ for identification (for any $r(n) = o(n)$) and $\exp(-Ω(n))$ for generation. Under pure $\varepsilon$-DP, the exponents degrade by a multiplicative factor of $\min{1, \varepsilon}$, which we show is tight up to constants. Notably, for generation under pure DP with mild assumptions, the upper bound $\exp(-\min{1,\varepsilon} \cdot Ω(n))$ matches the lower bound up to some constants, establishing an optimal rate. Our results show that the cost of privacy in language learning is surprisingly mild: absent entirely under approximate DP, and exactly a $\min{1,\varepsilon}$ factor in the exponent under pure DP.

关键词: large language models, differential privacy, language identification, language generation, privacy cost, error rates, theoretical analysis, optimal rate

25. ❌ SQLStructEval: Structural Evaluation of LLM Text-to-SQL Generation

作者: Yixi Zhou, Fan Zhang, Zhiqiao Guo, Yu Chen, Haipeng Zhang, Preslav Nakov, Zhuohan Xie 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06736v1

评分: 10.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 该论文的核心是评估LLM在Text-to-SQL生成中的结构可靠性，并提出了SQLStructEval框架。因此，它与关键词"Large Language Models” OR “LLMs” OR “Foundation Models"高度相关（10分），因为论文直接研究LLM生成的SQL查询。其他关键词如MoE、SLMs、Scaling Laws、Pre-training、Fine-tuning、Alignment、RLHF、PEFT、RAG、Context Window、KV Cache、Reasoning、Agents、Tool Use、Quantization、Inference、Hallucination、Interpretability、World Models、Model Merging、In-context Learning、AI for Science等，论文未涉及这些具体技术或应用领域，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文研究了LLM生成的Text-to-SQL查询的结构可靠性问题，发现即使执行结果正确，LLM也常产生结构多样的查询，并提出了SQLStructEval框架来通过AST分析结构，实验表明结构化生成方法能提高准确性和一致性。

摘要翻译

尽管在文本到SQL基准测试中表现强劲，大型语言模型生成的SQL程序是否具有结构可靠性仍不明确。本研究探究了LLM生成的SQL查询的结构行为，并提出了SQLStructEval框架，该框架通过规范抽象语法树表示来分析程序结构。我们在Spider基准上的实验表明，即使执行结果正确，现代LLM对相同输入也常生成结构差异显著的查询，且这种差异常由表层输入变化（如释义或模式呈现方式）触发。我们进一步证明，通过编译式流程在结构化空间中生成查询，能同时提升执行准确性与结构一致性。这些发现表明，结构可靠性是评估基于LLM的程序生成系统时关键但被忽视的维度。代码发布于https://anonymous.4open.science/r/StructEval-2435。

摘要 (Abstract)

Despite strong performance on Text-to-SQL benchmarks, it remains unclear whether LLM-generated SQL programs are structurally reliable. In this work, we investigate the structural behavior of LLM-generated SQL queries and introduce SQLStructEval, a framework for analyzing program structures through canonical abstract syntax tree (AST) representations. Our experiments on the Spider benchmark show that modern LLMs often produce structurally diverse queries for the same input, even when execution results are correct, and that such variance is frequently triggered by surface-level input changes such as paraphrases or schema presentation. We further show that generating queries in a structured space via a compile-style pipeline can improve both execution accuracy and structural consistency. These findings suggest that structural reliability is a critical yet overlooked dimension for evaluating LLM-based program generation systems. Our code is available at https://anonymous.4open.science/r/StructEval-2435.

关键词: Text-to-SQL, LLM-generated SQL, structural evaluation, abstract syntax tree (AST), SQLStructEval, program generation, Spider benchmark, structural consistency

26. ❌ EventFace: Event-Based Face Recognition via Structure-Driven Spatiotemporal Modeling

作者: Qingguo Meng, Xingbo Dong, Zhe Jin, Massimo Tistarelli 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06782v1

评分: 10.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	10.0/10	10.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文专注于事件相机的人脸识别，属于计算机视觉领域，而非大语言模型或深度学习技术原理的创新。唯一相关的关键词是"PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”，因为论文明确使用了LoRA（Low-Rank Adaptation）来迁移预训练RGB人脸模型的结构先验到事件域，这是核心方法之一，因此给10分。其他关键词均未涉及，故给0分。

!!! tip deepseek-chat TL;DR

该论文解决了事件相机人脸识别中缺乏稳定光度外观的问题，通过提出EventFace框架，结合LoRA迁移结构先验和时空建模，在自建数据集上实现了94.19%的Rank-1识别率和5.35%的EER，并表现出更强的光照鲁棒性。

摘要翻译

事件相机因其在光照鲁棒性和隐私友好性方面的固有优势，为人脸识别提供了一种前景广阔的传感模式。然而，由于事件流缺乏传统基于RGB的人脸识别系统所依赖的稳定光度外观，我们认为基于事件的人脸识别应当对由刚性面部运动与个体面部几何结构塑造的结构驱动的时空身份表征进行建模。由于目前仍缺乏专门用于基于事件的人脸识别的数据集，我们构建了EFace——一个在刚性面部运动下采集的小规模基于事件的人脸数据集。为了从这些有限的事件数据中有效学习，我们进一步提出了EventFace，这是一个用于基于事件的人脸识别的框架，它整合了空间结构和时间动态以进行身份建模。具体而言，我们采用低秩自适应（Low-Rank Adaptation, LoRA）技术，将结构化的面部先验知识从预训练的RGB人脸模型迁移到事件域，从而为身份建模建立一个可靠的空间基础。在此基础上，我们进一步引入了一个运动提示编码器（Motion Prompt Encoder, MPE）来显式编码时间特征，以及一个时空调制器（Spatiotemporal Modulator, STM）将其与空间特征融合，从而增强对身份相关事件模式的表征能力。大量实验表明，EventFace在所评估的基线方法中取得了最佳性能，其Rank-1识别率达到94.19%，等错误率（Equal Error Rate, EER）为5.35%。结果进一步表明，与竞争方法相比，EventFace在光照条件恶化时表现出更强的鲁棒性。此外，学习到的表征显示出降低的模板可重构性。

摘要 (Abstract)

Event cameras offer a promising sensing modality for face recognition due to their inherent advantages in illumination robustness and privacy-friendliness. However, because event streams lack the stable photometric appearance relied upon by conventional RGB-based face recognition systems, we argue that event-based face recognition should model structure-driven spatiotemporal identity representations shaped by rigid facial motion and individual facial geometry. Since dedicated datasets for event-based face recognition remain lacking, we construct EFace, a small-scale event-based face dataset captured under rigid facial motion. To learn effectively from this limited event data, we further propose EventFace, a framework for event-based face recognition that integrates spatial structure and temporal dynamics for identity modeling. Specifically, we employ Low-Rank Adaptation (LoRA) to transfer structural facial priors from pretrained RGB face models to the event domain, thereby establishing a reliable spatial basis for identity modeling. Building on this foundation, we further introduce a Motion Prompt Encoder (MPE) to explicitly encode temporal features and a Spatiotemporal Modulator (STM) to fuse them with spatial features, thereby enhancing the representation of identity-relevant event patterns. Extensive experiments demonstrate that EventFace achieves the best performance among the evaluated baselines, with a Rank-1 identification rate of 94.19% and an equal error rate (EER) of 5.35%. Results further indicate that EventFace exhibits stronger robustness under degraded illumination than the competing methods. In addition, the learned representations exhibit reduced template reconstructability.

关键词: Event-based face recognition, Event cameras, Low-Rank Adaptation (LoRA), Spatiotemporal modeling, Structure-driven representation, Illumination robustness, Motion Prompt Encoder, Spatiotemporal Modulator

27. ❌ From Static to Interactive: Adapting Visual in-Context Learners for User-Driven Tasks

作者: Carlos Schmidt, Simon Reiß 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06748v1

评分: 10.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	10.0/10	10.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉领域的视觉上下文学习（Visual In-Context Learning），特别是将静态模型（如DeLVM）转化为交互式系统（Interactive DeLVM），允许用户通过涂鸦、点击或画框等视觉线索动态引导模型预测。论文的核心是视觉任务（如交互式分割、超分辨率、对象移除）的交互式适应，而非大语言模型（LLMs）或深度学习技术原理的创新。因此，仅与关键词“In-context Learning”高度相关（10分），因为论文直接涉及视觉上下文学习的概念和应用。其他关键词均与大语言模型、模型训练、推理优化、代理系统、科学AI应用等无关，故评分为0分。加权总分计算为10.0（10 × 1.0）。

!!! tip deepseek-chat TL;DR

该论文解决了视觉上下文学习模型缺乏用户交互机制的问题，提出了一种将静态视觉上下文学习模型（如DeLVM）转化为交互式系统的方法，通过编码用户提供的视觉线索（如涂鸦、点击）到示例输入-输出对中，实现了在交互式分割、超分辨率和对象移除等任务上的显著性能提升。

摘要翻译

视觉上下文学习模型旨在通过利用一组示例输入-输出对来适应新任务，从而无需任务特定微调即可实现快速泛化。然而，这些模型本质上运行于一种静态范式：尽管它们能够适应新任务，却缺乏任何机制来整合用户提供的引导信号（如涂鸦、点击或边界框）以引导或优化预测过程。这一限制在现实应用中尤为突出，因为用户往往希望主动引导模型预测，例如通过高亮目标对象以进行分割、指示应进行视觉修改的区域，或在复杂场景中隔离特定人物以运行针对性姿态估计。在本研究中，我们提出了一种简单方法，将静态视觉上下文学习模型（特别是DeLVM方法）转化为高度可控的用户驱动系统，即交互式DeLVM，使其能够通过涂鸦、点击或绘制边界框等自然视觉线索实现无缝交互。具体而言，通过将交互直接编码到示例输入-输出对中，我们保持了视觉上下文学习的核心理念不变：使用户能够通过未见过的交互方式提示模型而无需微调，并赋能用户通过个性化交互动态引导模型预测。我们的实验表明，现有最先进的视觉上下文学习模型无法有效利用交互线索，常常完全忽略用户引导。相比之下，我们的方法在可控的用户引导场景中表现卓越，在交互式分割任务中实现了$+7.95%$交并比提升，在定向超分辨率任务中实现了$+2.46$峰值信噪比提升，在交互式对象移除任务中实现了$-3.14%$学习感知图像块相似度降低。由此，我们的工作弥合了以用户为中心的视觉上下文学习中僵化的静态任务适应与流畅交互性之间的鸿沟。

摘要 (Abstract)

Visual in-context learning models are designed to adapt to new tasks by leveraging a set of example input-output pairs, enabling rapid generalization without task-specific fine-tuning. However, these models operate in a fundamentally static paradigm: while they can adapt to new tasks, they lack any mechanism to incorporate user-provided guidance signals such as scribbles, clicks, or bounding boxes to steer or refine the prediction process. This limitation is particularly restrictive in real-world applications, where users want to actively guide model predictions, e.g., by highlighting the target object for segmentation, indicating a region which should be visually altered, or isolating a specific person in a complex scene to run targeted pose estimation. In this work, we propose a simple method to transform static visual in-context learners, particularly the DeLVM approach, into highly controllable, user-driven systems, i.e., Interactive DeLVM, enabling seamless interaction through natural visual cues such as scribbles, clicks, or drawing boxes. Specifically, by encoding interactions directly into the example input-output pairs, we keep the philosophy of visual in-context learning intact: enabling users to prompt models with unseen interactions without fine-tuning and empowering them to dynamically steer model predictions with personalized interactions. Our experiments demonstrate that SOTA visual in-context learning models fail to effectively leverage interaction cues, often ignoring user guidance entirely. In contrast, our method excels in controllable, user-guided scenarios, achieving improvements of $+7.95%$ IoU for interactive segmentation, $+2.46$ PSNR for directed super-resolution, and $-3.14%$ LPIPS for interactive object removal. With this, our work bridges the gap between rigid static task adaptation and fluid interactivity for user-centric visual in-context learning.

关键词: Visual In-Context Learning, Interactive DeLVM, User-Driven Systems, Interactive Segmentation, Super-Resolution, Object Removal, Task Adaptation, Visual Cues

28. ❌ Sparse-Aware Neural Networks for Nonlinear Functionals: Mitigating the Exponential Dependence on Dimension

作者: Jianfei Li, Shuo Huang, Han Feng, Ding-Xuan Zhou, Gitta Kutyniok 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06774v1

评分: 5.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	5.0/10	5.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文研究稀疏性在函数学习中的应用，属于深度学习理论领域，与大多数大模型技术关键词（如LLM、微调、对齐、推理等）无关。唯一相关的是"Mixture of Experts” OR “MoE” OR “Sparse Models”，因为论文核心涉及稀疏模型（sparse models）和稀疏特征提取，但并非专门针对MoE架构，因此给5分（有一定关联）。其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

该论文研究了如何利用稀疏性来缓解深度学习在无限维函数空间学习非线性泛函时的维度灾难问题，提出了结合卷积和全连接网络的框架，证明了稀疏近似器能从离散样本中稳定恢复，并提高了近似率、减少了样本需求。

摘要翻译

深度神经网络已成为学习定义在无限维函数空间上的算子的强大工具。然而，现有理论常面临与维数过高和可解释性有限相关的困难。本研究探讨了稀疏性如何帮助解决函数学习（算子学习的核心组成部分）中的这些挑战。我们提出了一个框架，该框架利用卷积架构从有限数量的样本中提取稀疏特征，并结合深度全连接网络来有效逼近非线性泛函。通过通用离散化方法，我们证明了稀疏逼近器能够从离散样本中实现稳定恢复。此外，确定性采样方案与随机采样方案均足以支持我们的分析。这些发现使得在各类函数空间（包括具有快速频率衰减和混合光滑性的空间）中获得了更优的逼近速率并减少了所需样本量。它们也为稀疏性如何缓解函数学习中的维度诅咒问题提供了新的理论见解。

摘要 (Abstract)

Deep neural networks have emerged as powerful tools for learning operators defined over infinite-dimensional function spaces. However, existing theories frequently encounter difficulties related to dimensionality and limited interpretability. This work investigates how sparsity can help address these challenges in functional learning, a central ingredient in operator learning. We propose a framework that employs convolutional architectures to extract sparse features from a finite number of samples, together with deep fully connected networks to effectively approximate nonlinear functionals. Using universal discretization methods, we show that sparse approximators enable stable recovery from discrete samples. In addition, both the deterministic and the random sampling schemes are sufficient for our analysis. These findings lead to improved approximation rates and reduced sample sizes in various function spaces, including those with fast frequency decay and mixed smoothness. They also provide new theoretical insights into how sparsity can alleviate the curse of dimensionality in functional learning.

关键词: sparse models, functional learning, curse of dimensionality, deep neural networks, nonlinear functionals, approximation theory, convolutional architectures, sample complexity

29. ❌ Fast Spatial Memory with Elastic Test-Time Training

作者: Ziqiao Ma, Xueyang Yu, Haoyu Zhen, Yuncong Yang, Joyce Chai, Chuang Gan 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07350v1

评分: 5.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的3D/4D重建任务，提出了一种名为Elastic Test-Time Training的方法来改进LaCT（Large Chunk Test-Time Training），并基于此构建了Fast Spatial Memory（FSM）模型。论文的核心是处理长序列视觉数据的测试时训练、快速权重更新和空间记忆建模，属于深度学习在视觉科学计算中的应用。然而，论文内容与绝大多数评分关键词（主要围绕大语言模型的技术原理、训练方法、推理优化、对齐、代理系统等）完全无关。唯一略有相关的是"Pre-training”，因为论文提到FSM在大型3D/4D数据上进行了预训练，但这并非论文的核心创新点，只是模型构建的一个步骤，因此给予5分（有一定关联）。其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

该论文针对长序列3D/4D重建中测试时训练存在的灾难性遗忘和过拟合问题，提出了基于弹性权重巩固的Elastic Test-Time Training方法，并构建了Fast Spatial Memory模型，实现了对长观测序列的高质量、高效重建。

摘要翻译

大块测试时训练（LaCT）在长上下文三维重建任务中表现出色，但其完全可塑的推理时更新仍易受灾难性遗忘和过拟合的影响。因此，LaCT通常被实例化为覆盖整个输入序列的单一大型数据块，未能实现单次处理任意长序列的更广泛目标。受弹性权重巩固启发，我们提出弹性测试时训练方法，通过围绕固定锚点状态的费舍尔加权弹性先验来稳定LaCT的快速权重更新。该锚点以过去快速权重的指数移动平均形式演化，以平衡稳定性与可塑性。基于此更新架构，我们引入快速空间记忆（FSM）——一种高效可扩展的四维重建模型，能够从长观测序列中学习时空表征，并渲染新颖的视角-时间组合。我们在大规模精选三维/四维数据上对FSM进行预训练，以捕捉复杂空间环境的动态特性与语义信息。大量实验表明，FSM支持长序列的快速适应，能够以更小的数据块实现高质量的三维/四维重建，同时缓解相机插值捷径问题。总体而言，我们希望推动LaCT突破有限单块处理的限制，实现稳健的多块适应能力——这是向真正长序列泛化的必要步骤，同时显著缓解激活内存瓶颈问题。

摘要 (Abstract)

Large Chunk Test-Time Training (LaCT) has shown strong performance on long-context 3D reconstruction, but its fully plastic inference-time updates remain vulnerable to catastrophic forgetting and overfitting. As a result, LaCT is typically instantiated with a single large chunk spanning the full input sequence, falling short of the broader goal of handling arbitrarily long sequences in a single pass. We propose Elastic Test-Time Training inspired by elastic weight consolidation, that stabilizes LaCT fast-weight updates with a Fisher-weighted elastic prior around a maintained anchor state. The anchor evolves as an exponential moving average of past fast weights to balance stability and plasticity. Based on this updated architecture, we introduce Fast Spatial Memory (FSM), an efficient and scalable model for 4D reconstruction that learns spatiotemporal representations from long observation sequences and renders novel view-time combinations. We pre-trained FSM on large-scale curated 3D/4D data to capture the dynamics and semantics of complex spatial environments. Extensive experiments show that FSM supports fast adaptation over long sequences and delivers high-quality 3D/4D reconstruction with smaller chunks and mitigating the camera-interpolation shortcut. Overall, we hope to advance LaCT beyond the bounded single-chunk setting toward robust multi-chunk adaptation, a necessary step for generalization to genuinely longer sequences, while substantially alleviating the activation-memory bottleneck.

关键词: Test-Time Training, 3D Reconstruction, 4D Reconstruction, Long-context, Elastic Weight Consolidation, Fast Spatial Memory, Spatiotemporal Representations, Catastrophic Forgetting

30. ❌ How Well Do Vision-Language Models Understand Sequential Driving Scenes? A Sensitivity Study

作者: Roberto Brusnicki, Mattia Piccinini, Johannes Betz 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06750v1

评分: 5.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 该论文研究视觉语言模型（VLMs）在自动驾驶场景中的性能评估，属于大模型在特定领域（自动驾驶）的应用研究。论文核心是评估现有VLMs在理解时序驾驶场景时的能力，并分析输入配置对性能的影响。因此，仅与第一个关键词（“Large Language Models” OR “LLMs” OR “Foundation Models”）有一定关联，因为VLMs可视为大模型的一种扩展或变体（结合视觉和语言模态），但论文未深入探讨LLMs的核心技术原理（如MoE、Scaling Laws、训练方法等），也未涉及其他关键词（如推理方法、对齐、压缩等）。其他关键词均与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文通过VENUSS框架系统评估了25+视觉语言模型在时序驾驶场景中的理解能力，发现即使顶级模型准确率仅57%，显著低于人类水平（65%），且模型在车辆动态和时序关系理解上存在明显短板。

摘要翻译

视觉语言模型（Vision-Language Models, VLMs）在自动驾驶任务中的应用日益增多，但其在连续驾驶场景中的性能仍缺乏充分表征，尤其是在输入配置如何影响其能力方面。我们提出了VENUSS（VLM Evaluation oN Understanding Sequential Scenes），这是一个用于系统分析VLM在连续驾驶场景中性能敏感性的框架，为未来研究建立了基准。基于现有数据集，VENUSS从驾驶视频中提取时间序列，并针对自定义类别生成结构化评估。通过比较超过25个现有VLM在2600多个场景中的表现，我们发现即使顶级模型也仅达到57%的准确率，未能达到人类在类似约束条件下的表现（65%），并暴露出显著的能力差距。我们的分析表明，VLM在静态物体检测方面表现优异，但在理解车辆动态和时间关系方面存在困难。VENUSS首次对VLM进行了系统性的敏感性分析，重点关注输入图像配置——分辨率、帧数、时间间隔、空间布局和呈现模式——如何影响其在连续驾驶场景中的性能。补充材料可在https://V3NU55.github.io获取。

摘要 (Abstract)

Vision-Language Models (VLMs) are increasingly proposed for autonomous driving tasks, yet their performance on sequential driving scenes remains poorly characterized, particularly regarding how input configurations affect their capabilities. We introduce VENUSS (VLM Evaluation oN Understanding Sequential Scenes), a framework for systematic sensitivity analysis of VLM performance on sequential driving scenes, establishing baselines for future research. Building upon existing datasets, VENUSS extracts temporal sequences from driving videos, and generates structured evaluations across custom categories. By comparing 25+ existing VLMs across 2,600+ scenarios, we reveal how even top models achieve only 57% accuracy, not matching human performance in similar constraints (65%) and exposing significant capability gaps. Our analysis shows that VLMs excel with static object detection but struggle with understanding the vehicle dynamics and temporal relations. VENUSS offers the first systematic sensitivity analysis of VLMs focused on how input image configurations - resolution, frame count, temporal intervals, spatial layouts, and presentation modes - affect performance on sequential driving scenes. Supplementary material available at https://V3NU55.github.io

关键词: Vision-Language Models, autonomous driving, sequential scenes, sensitivity analysis, temporal understanding, model evaluation, driving scenarios, input configurations

31. ❌ Toward a Tractability Frontier for Exact Relevance Certification

作者: Tristan Simas 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07349v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training	0.0	0.0/10	0.0
Post-training	0.0	0.0/10	0.0
Instruction Tuning	0.0	0.0/10	0.0
RLHF	0.0	0.0/10	0.0
PEFT	0.0	0.0/10	0.0
Retrieval-Augmented Generation	0.0	0.0/10	0.0
Context Window Extension	0.0	0.0/10	0.0
KV Cache Compression	0.0	0.0/10	0.0
Chain of Thought	0.0	0.0/10	0.0
System 2 Thinking	0.0	0.0/10	0.0
Monte Carlo Tree Search AND LLM	0.0	0.0/10	0.0
Self-Correction	0.0	0.0/10	0.0
LLM Agents	0.0	0.0/10	0.0
Tool Use	0.0	0.0/10	0.0
Multi-agent Systems	0.0	0.0/10	0.0
Quantization	0.0	0.0/10	0.0
Speculative Decoding	0.0	0.0/10	0.0
Hallucination Mitigation	0.0	0.0/10	0.0
Mechanistic Interpretability	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging	0.0	0.0/10	0.0
In-context Learning	0.0	0.0/10	0.0
AI for Science	0.0	0.0/10	0.0

评分理由: 论文研究的是决策问题中的精确相关性认证的可处理性边界，属于理论计算机科学/优化理论领域。摘要中提到的概念如’coordinate-structured decision problem’、‘finite primitive basis’、‘optimizer-quotient realizability’、‘meta-impossibility theorem’、‘closure laws’、‘obstruction families’等均与深度学习、大模型技术无关。论文没有涉及任何神经网络架构、训练方法、推理优化、对齐技术或科学AI应用，因此所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

该论文研究了坐标结构决策问题中精确相关性认证的可处理性边界，证明了对于闭包封闭域上的正确分类器，不存在能够精确刻画可处理性边界的有效结构谓词。

摘要翻译

精确相关性认证探究在坐标结构决策问题中，哪些坐标是确定最优行动所必需的。本文所处理的易处理族允许有限原始基，但优化器-商可实现性达到最大，因此仅凭商形状无法刻画前沿。
我们证明了一个关于高效可检验结构谓词的元不可能性定理，这些谓词在精确认证的定理强制闭包律下保持不变。通过零失真摘要的结构收敛、商熵界限及支集计数论证，解释了这些闭包律为何是典范的。我们通过为四个阻碍族——即主导对集中、边际掩蔽、幽灵行动集中以及加性/状态逐点偏移集中——构造同轨道分歧来证明该定理，其中使用了行动无关、针对成对目标的仿射见证。因此，在闭包封闭域上，任何正确的易处理性分类器都无法对这些族给出精确刻画。此处，闭包轨道一致性是由正确性强制而非作为不变性公理假设的。因此，该结果适用于闭包封闭域上的正确分类器，而不仅限于通过特定可容许性包给出的分类器。

摘要 (Abstract)

Exact relevance certification asks which coordinates are necessary to determine the optimal action in a coordinate-structured decision problem. The tractable families treated here admit a finite primitive basis, but optimizer-quotient realizability is maximal, so quotient shape alone cannot characterize the frontier. We prove a meta-impossibility theorem for efficiently checkable structural predicates invariant under the theorem-forced closure laws of exact certification. Structural convergence with zero-distortion summaries, quotient entropy bounds, and support-counting arguments explains why those closure laws are canonical. We establish the theorem by constructing same-orbit disagreements for four obstruction families, namely dominant-pair concentration, margin masking, ghost-action concentration, and additive/statewise offset concentration, using action-independent, pair-targeted affine witnesses. Consequently no correct tractability classifier on a closure-closed domain yields an exact characterization over these families. Here closure-orbit agreement is forced by correctness rather than assumed as an invariance axiom. The result therefore applies to correct classifiers on closure-closed domains, not only to classifiers presented through a designated admissibility package.

关键词: exact relevance certification, tractability frontier, coordinate-structured decision problem, meta-impossibility theorem, closure laws, obstruction families, optimizer-quotient realizability

32. ❌ RoSHI: A Versatile Robot-oriented Suit for Human Data In-the-Wild

作者: Wenjing Margaret Mao, Jefferson Ng, Luyang Hu, Daniel Gehrig, Antonio Loquercio 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07331v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于机器人学习领域的数据采集系统开发，具体研究混合可穿戴设备（IMU+AR眼镜）用于人体姿态估计和运动捕捉，以支持人形机器人策略学习。论文内容完全围绕传感器融合、姿态估计、数据集收集和机器人学习应用展开，未涉及任何大语言模型、深度学习技术原理、模型训练优化、推理加速、AI对齐、智能体系统等关键词相关的技术或应用。所有关键词均与大模型和深度学习技术直接相关，而本文是纯粹的机器人硬件系统和数据采集研究，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文开发了一种名为RoSHI的混合可穿戴系统，通过融合低成本IMU和Project Aria眼镜来估计穿戴者的3D姿态和身体形状，以解决野外人类数据采集中的可移植性、遮挡鲁棒性和全局一致性问题，并证明该系统采集的运动数据适用于真实世界的人形机器人策略学习。

摘要翻译

扩展机器人学习规模可能需要包含丰富且长时程野外交互的人类数据。现有数据采集方法在便携性、遮挡鲁棒性与全局一致性之间存在权衡。我们提出RoSHI——一种融合低成本稀疏惯性测量单元与Project Aria眼镜的混合式可穿戴系统，通过第一人称视角感知在公制全局坐标系中估计穿戴者的完整三维姿态与身体形态。该系统的设计基于两种传感器的互补性：惯性测量单元为遮挡和高速运动提供鲁棒性，而第一人称SLAM（同步定位与建图）则锚定长时程运动并稳定上半身姿态。我们采集了包含敏捷动作的数据集以评估RoSHI系统。在该数据集上，我们的方法总体优于其他第一人称基线系统，并与当前最优的第三人称基线系统（SAM3D）性能相当。最后，我们验证了通过本系统记录的运动数据适用于现实世界人形机器人策略学习。视频、数据等更多信息请访问项目网页：https://roshi-mocap.github.io/

摘要 (Abstract)

Scaling up robot learning will likely require human data containing rich and long-horizon interactions in the wild. Existing approaches for collecting such data trade off portability, robustness to occlusion, and global consistency. We introduce RoSHI, a hybrid wearable that fuses low-cost sparse IMUs with the Project Aria glasses to estimate the full 3D pose and body shape of the wearer in a metric global coordinate frame from egocentric perception. This system is motivated by the complementarity of the two sensors: IMUs provide robustness to occlusions and high-speed motions, while egocentric SLAM anchors long-horizon motion and stabilizes upper body pose. We collect a dataset of agile activities to evaluate RoSHI. On this dataset, we generally outperform other egocentric baselines and perform comparably to a state-of-the-art exocentric baseline (SAM3D). Finally, we demonstrate that the motion data recorded from our system are suitable for real-world humanoid policy learning. For videos, data and more, visit the project webpage: https://roshi-mocap.github.io/

关键词: robot learning, wearable system, human pose estimation, motion capture, egocentric perception, humanoid policy learning, sensor fusion, in-the-wild data collection

33. ❌ MoRight: Motion Control Done Right

作者: Shaowei Liu, Xuanchi Ren, Tianchang Shen, Huan Ling, Saurabh Gupta, Shenlong Wang, Sanja Fidler, Jun Gao 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07348v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文MoRight专注于视频生成中的运动控制，特别是解耦物体运动与相机视角控制以及建模运动因果关系。其核心是计算机视觉和生成模型（如扩散模型）的应用，而非大语言模型（LLM）或深度学习技术原理的创新。所有评分关键词均直接针对大语言模型（LLM）的技术、训练、对齐、推理、应用（如智能体）或特定科学领域（如生物信息学）。论文摘要和标题未提及任何LLM、其相关技术（如MoE、RLHF、RAG）或AI for Science的具体应用。因此，所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了MoRight框架，解决了视频生成中运动控制的两个关键问题——解耦物体运动与相机视角控制以及建模运动因果关系，实现了用户可分别控制物体运动、调整视角并能进行前向或逆向推理生成物理合理的动态视频，在多个基准测试中取得了最先进的性能。

摘要翻译

生成运动控制视频——即用户指定的动作在自由选择的视角下驱动物理合理的场景动态——需要具备两种能力：(1)解耦的运动控制，允许用户分别控制物体运动并调整摄像机视角；(2)运动因果性，确保用户驱动的动作能触发其他物体的连贯反应，而非仅进行像素位移。现有方法在这两方面均存在不足：它们将摄像机与物体运动纠缠为单一跟踪信号，并将运动视为运动学位移而未建模物体间的因果关系。我们提出MoRight这一统一框架，通过解耦运动建模同时解决这两个局限。物体运动在规范静态视角中被指定，并通过时序跨视角注意力机制转移到任意目标摄像机视角，从而实现摄像机与物体控制的解耦。我们进一步将运动分解为主动（用户驱动）与被动（结果）分量，训练模型从数据中学习运动因果性。在推理阶段，用户既可提供主动运动由MoRight预测结果（正向推理），也可指定期望的被动结果由MoRight反推合理驱动动作（逆向推理），同时全程支持自由调整摄像机视角。在三个基准测试上的实验表明，该方法在生成质量、运动可控性与交互感知方面均达到最先进性能。

摘要 (Abstract)

Generating motion-controlled videos–where user-specified actions drive physically plausible scene dynamics under freely chosen viewpoints–demands two capabilities: (1) disentangled motion control, allowing users to separately control the object motion and adjust camera viewpoint; and (2) motion causality, ensuring that user-driven actions trigger coherent reactions from other objects rather than merely displacing pixels. Existing methods fall short on both fronts: they entangle camera and object motion into a single tracking signal and treat motion as kinematic displacement without modeling causal relationships between object motion. We introduce MoRight, a unified framework that addresses both limitations through disentangled motion modeling. Object motion is specified in a canonical static-view and transferred to an arbitrary target camera viewpoint via temporal cross-view attention, enabling disentangled camera and object control. We further decompose motion into active (user-driven) and passive (consequence) components, training the model to learn motion causality from data. At inference, users can either supply active motion and MoRight predicts consequences (forward reasoning), or specify desired passive outcomes and MoRight recovers plausible driving actions (inverse reasoning), all while freely adjusting the camera viewpoint. Experiments on three benchmarks demonstrate state-of-the-art performance in generation quality, motion controllability, and interaction awareness.

关键词: motion-controlled video generation, disentangled motion control, motion causality, temporal cross-view attention, active and passive motion, forward and inverse reasoning, viewpoint adjustment, interaction awareness

34. ❌ Syntax Is Easy, Semantics Is Hard: Evaluating LLMs for LTL Translation

作者: Priscilla Kyei Danso, Mohammad Saqib Hasan, Niranjan Balasubramanian, Omar Chowdhury 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07321v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文核心研究LLMs在LTL公式翻译任务中的表现评估，仅与’Large Language Models’关键词高度相关（10分），其他关键词均未涉及，因此得分为0。论文未提出新的LLM技术，而是评估现有LLMs在特定领域的应用效果，属于应用研究而非技术创新。

!!! tip deepseek-chat TL;DR

该论文评估了大型语言模型将自然语言断言翻译为线性时序逻辑公式的有效性，发现LLMs在语法方面表现优于语义方面，详细提示和任务重构为Python代码补全问题能显著提升性能。

摘要翻译

命题线性时序逻辑（Propositional Linear Temporal Logic，LTL）是一种广泛使用的形式化语言，用于为软件、网络和系统指定需求以及安全与隐私策略。然而，由于其复杂的语义，用LTL表达此类需求和策略仍然具有挑战性。由于许多安全与隐私分析工具需要以LTL公式作为输入，这一困难使得许多开发人员和分析师难以使用这些工具。大型语言模型（Large Language Models, LLMs）有望通过将自然语言片段转化为LTL公式来拓宽此类工具的可用性。本文通过评估几种代表性LLM将断言式英语句子翻译为LTL公式的有效性来检验这一前提。利用人工生成和合成的基准数据，我们从语法和语义两个维度评估其有效性。结果揭示了三点发现：（1）与先前研究一致，LLMs在LTL的语法方面表现优于语义方面；（2）更详细的提示通常能提升其性能；（3）将任务重新表述为Python代码补全问题能显著提高整体性能。我们还讨论了在此任务中进行公平评估所面临的挑战，并为未来工作提出了建议。

摘要 (Abstract)

Propositional Linear Temporal Logic (LTL) is a popular formalism for specifying desirable requirements and security and privacy policies for software, networks, and systems. Yet expressing such requirements and policies in LTL remains challenging because of its intricate semantics. Since many security and privacy analysis tools require LTL formulas as input, this difficulty places them out of reach for many developers and analysts. Large Language Models (LLMs) could broaden access to such tools by translating natural language fragments into LTL formulas. This paper evaluates that premise by assessing how effectively several representative LLMs translate assertive English sentences into LTL formulas. Using both human-generated and synthetic ground-truth data, we evaluate effectiveness along syntactic and semantic dimensions. The results reveal three findings: (1) in line with prior findings, LLMs perform better on syntactic aspects of LTL than on semantic ones; (2) they generally benefit from more detailed prompts; and (3) reformulating the task as a Python code-completion problem substantially improves overall performance. We also discuss challenges in conducting a fair evaluation on this task and conclude with recommendations for future work.

关键词: Large Language Models, LTL translation, syntactic evaluation, semantic evaluation, prompt engineering, Python code-completion, natural language to formal logic

35. ❌ Evaluating In-Context Translation with Synchronous Context-Free Grammar Transduction

作者: Jackson Petty, Jaulie Goe, Tal Linzen 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07320v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	10.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在上下文学习（in-context learning）能力下的翻译任务，直接涉及’Large Language Models’和’In-context Learning’两个关键词，分别给予10分。论文提到LLMs在翻译中会产生幻觉（hallucinate new words），与’Hallucination Mitigation’有一定关联，给予5分。其他关键词如MoE、SFT、RAG、量化等均未在摘要中提及或相关，给予0分。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型（LLMs）在给定同步上下文无关文法作为上下文描述时，执行形式语言翻译任务的能力，发现翻译准确率随文法规模和句子长度增加而显著下降，且形态和书写差异会严重影响性能，模型主要错误包括错误回忆目标词汇、产生幻觉和未翻译源语言词汇。

摘要翻译

低资源语言对依赖大量训练数据的大语言模型（LLM）机器翻译构成了挑战。一种规避这种数据依赖的潜在方法是利用LLM使用语境中语言描述（如教科书和词典）的能力。为此，LLM必须能够推断出语言语法描述与相关句子之间的联系。在此，我们通过该任务的一个形式化类比来分离这种能力：基于语境中提供的形式语法进行字符串转导。我们构建了同步上下文无关语法，这些语法定义了旨在模拟自然语言语法、形态和书写表征特定方面的形式语言对。利用这些语法，我们测量了当同时给定语法和源语言句子时，LLM将句子从一个形式语言翻译到另一个的能力。我们改变了语法规模、句子长度、语言的句法与形态特性以及它们的书写文字。我们注意到三个关键发现。首先，LLM的翻译准确率随着语法规模和句子长度的增加而显著下降。其次，源语言与目标语言之间在形态和书写表征上的差异会严重降低模型性能。第三，我们检查了模型所犯的错误类型，发现它们最容易出现以下问题：从目标语言词汇中错误召回单词、幻觉出新词或遗留源语言单词未翻译。

摘要 (Abstract)

Low-resource languages pose a challenge for machine translation with large language models (LLMs), which require large amounts of training data. One potential way to circumvent this data dependence is to rely on LLMs’ ability to use in-context descriptions of languages, like textbooks and dictionaries. To do so, LLMs must be able to infer the link between the languages’ grammatical descriptions and the sentences in question. Here we isolate this skill using a formal analogue of the task: string transduction based on a formal grammar provided in-context. We construct synchronous context-free grammars which define pairs of formal languages designed to model particular aspects of natural language grammar, morphology, and written representation. Using these grammars, we measure how well LLMs can translate sentences from one formal language into another when given both the grammar and the source-language sentence. We vary the size of the grammar, the lengths of the sentences, the syntactic and morphological properties of the languages, and their written script. We note three key findings. First, LLMs’ translation accuracy decreases markedly as a function of grammar size and sentence length. Second, differences in morphology and written representation between the source and target languages can strongly diminish model performance. Third, we examine the types of errors committed by models and find they are most prone to recall the wrong words from the target language vocabulary, hallucinate new words, or leave source-language words untranslated.

关键词: Large Language Models, In-context Learning, Machine Translation, Synchronous Context-Free Grammar, Formal Language, Low-resource Languages, Hallucination, Grammar Size

36. ❌ Region-Graph Optimal Transport Routing for Mixture-of-Experts Whole-Slide Image Classification

作者: Xin Tian, Jiuliu Lu, Ephraim Tsalik, Bart Wanders, Colleen Knoth, Julian Knight 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07298v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文核心贡献是提出ROAM（Region-graph OptimAl-transport Mixture-of-experts），一种用于全切片图像分类的混合专家（MoE）方法，通过最优传输实现平衡的路由机制。因此，与’Mixture of Experts OR MoE OR Sparse Models’高度相关（10分），因为MoE是论文的核心技术框架。与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（8分），因为论文应用于计算病理学（生物信息学领域），但主要焦点是MoE方法而非AI for Science的广泛主题。其他关键词（如LLMs、RLHF、RAG等）与论文内容无关，因为论文专注于计算机视觉和医学图像分析，不涉及语言模型或相关技术。

!!! tip deepseek-chat TL;DR

该论文针对全切片图像分类中混合专家方法的路由不平衡问题，提出了一种基于区域图最优传输的MoE聚合器（ROAM），在多个基准测试中实现了竞争性性能，并在NSCLC泛化任务中达到0.845 AUC。

摘要翻译

多示例学习（Multiple Instance Learning, MIL）是计算病理学中处理千兆像素全切片图像（Whole-Slide Image, WSI）分类的主流框架。然而，现有的MIL聚合器将所有实例通过共享路径进行路由，限制了它们针对每个切片固有的病理异质性进行专门化处理的能力。混合专家（Mixture-of-Experts, MoE）方法通过将实例分配给专门的专家子网络提供了一种自然的解决方案；但无约束的softmax路由可能导致专家利用率高度不均衡，即一个或少数专家吸收了大部分路由权重，使混合模型退化为近乎单一路径的解决方案。为应对这些局限，我们提出了ROAM（区域图最优传输混合专家模型），这是一种具有空间感知能力的MoE-MIL聚合器，它通过容量约束的熵正则化最优传输将区域标记路由至专家池化器，从而在结构上促进专家利用的均衡性。ROAM操作于空间区域标记，这些标记通过将密集的补丁包压缩为空间分箱单元获得，使路由与局部组织邻域对齐，并引入了两个关键机制：（i）将区域到专家的分配建模为具有显式每切片容量边际的熵正则化最优传输（Sinkhorn算法），无需辅助的负载均衡损失即可强制实现均衡的专家利用；（ii）图正则化的Sinkhorn迭代，在空间区域图上扩散路由分配，鼓励相邻区域一致地路由至相同的专家。在四个WSI基准测试中使用冻结的基础模型补丁嵌入进行评估，ROAM取得了与强MIL和MoE基线相竞争的性能，并在非小细胞肺癌泛化任务（TCGA-CPTAC）上达到了0.845 ± 0.019的外部AUC。

摘要 (Abstract)

Multiple Instance Learning (MIL) is the dominant framework for gigapixel whole-slide image (WSI) classification in computational pathology. However, current MIL aggregators route all instances through a shared pathway, constraining their capacity to specialise across the pathological heterogeneity inherent in each slide. Mixture-of-Experts (MoE) methods offer a natural remedy by partitioning instances across specialised expert subnetworks; yet unconstrained softmax routing may yield highly imbalanced utilisation, where one or a few experts absorb most routing mass, collapsing the mixture back to a near-single-pathway solution. To address these limitations, we propose ROAM (Region-graph OptimAl-transport Mixture-of-experts), a spatially aware MoE-MIL aggregator that routes region tokens to expert poolers via capacity-constrained entropic optimal transport, promoting balanced expert utilisation by construction. ROAM operates on spatial region tokens, obtained by compressing dense patch bags into spatially binned units that align routing with local tissue neighbourhoods and introduces two key mechanisms: (i) region-to-expert assignment formulated as entropic optimal transport (Sinkhorn) with explicit per slide capacity marginals, enforcing balanced expert utilisation without auxiliary load-balancing losses; and (ii) graph-regularised Sinkhorn iterations that diffuse routing assignments over the spatial region graph, encouraging neighbouring regions to coherently route to the same experts. Evaluated on four WSI benchmarks with frozen foundation-model patch embeddings, ROAM achieves performance competitive against strong MIL and MoE baselines, and on NSCLC generalisation (TCGA-CPTAC) reaches external AUC 0.845 +- 0.019.

关键词: Mixture-of-Experts, Whole-Slide Image Classification, Optimal Transport, Multiple Instance Learning, Computational Pathology, Region-graph, Spatial Routing, Balanced Expert Utilisation

作者: Timothy K Johnsen, Marco Levorato 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07286v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文CADENCE专注于自适应深度估计系统，用于自动驾驶车辆在资源受限环境下的导航优化。虽然涉及深度学习（深度估计网络）和计算效率优化，但所有评分关键词均针对大语言模型（LLM）相关技术，包括模型架构、训练方法、推理优化、对齐技术、代理系统等特定领域。该论文研究的是计算机视觉和嵌入式系统优化，与LLM技术栈无直接关联，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

论文提出了一种自适应深度估计系统CADENCE，通过动态调整计算复杂度来平衡自动驾驶车辆的感知精度与计算资源消耗，在嵌入式平台上实现了75%的能耗降低和7.43%的导航精度提升。

摘要翻译

部署在偏远环境中的自动驾驶车辆通常依赖嵌入式处理器、紧凑型电池与轻量化传感器。这些硬件限制与构建环境鲁棒表征的需求存在矛盾，后者常需执行计算密集型深度神经网络以完成感知任务。为应对这一挑战，我们提出CADENCE系统——一种能够根据导航需求与环境上下文动态调节可瘦身单目深度估计网络计算复杂度的自适应系统。通过建立感知精度与执行需求之间的闭环，CADENCE确保仅在任务关键阶段启用高精度计算。我们在自主发布的开源测试平台上进行评估，该平台将微软AirSim仿真环境与英伟达Jetson Orin Nano嵌入式系统相集成。相较于当前最先进的静态方法，CADENCE分别将传感器数据采集量、功耗及推理延迟降低了9.67%、16.1%和74.8%。实验结果表明系统整体能耗降低75.0%，同时导航精度提升7.43%。

摘要 (Abstract)

Autonomous vehicles deployed in remote environments typically rely on embedded processors, compact batteries, and lightweight sensors. These hardware limitations conflict with the need to derive robust representations of the environment, which often requires executing computationally intensive deep neural networks for perception. To address this challenge, we present CADENCE, an adaptive system that dynamically scales the computational complexity of a slimmable monocular depth estimation network in response to navigation needs and environmental context. By closing the loop between perception fidelity and actuation requirements, CADENCE ensures high-precision computing is only used when mission-critical. We conduct evaluations on our released open-source testbed that integrates Microsoft AirSim with an NVIDIA Jetson Orin Nano. As compared to a state-of-the-art static approach, CADENCE decreases sensor acquisitions, power consumption, and inference latency by 9.67%, 16.1%, and 74.8%, respectively. The results demonstrate an overall reduction in energy expenditure by 75.0%, along with an increase in navigation accuracy by 7.43%.

关键词: adaptive depth estimation, autonomous vehicles, computational efficiency, embedded systems, slimmable neural networks, energy optimization, perception-actuation loop, NVIDIA Jetson

38. ❌ Android Coach: Improve Online Agentic Training Efficiency with Single State Multiple Actions

作者: Guo Gan, Yuxuan Ding, Cong Chen, Yuwei Ren, Yin Huang, Hong Zhou 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07277v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于Android智能体的在线强化学习训练效率问题，提出了一种新的训练范式（Single State Multiple Actions）和框架（Android Coach），通过引入评论家网络和组优势估计器来提升训练效率。虽然论文涉及智能体（agent）和强化学习，但所有关键词都明确针对大语言模型（LLM）及其相关技术（如MoE、SFT、RLHF、RAG、量化等），或特定AI应用领域（如生物信息学）。论文的核心是强化学习算法优化和Android环境下的智能体训练，并未涉及任何大语言模型技术、原理或应用，也未涉及AI for Science等指定领域。因此，所有关键词与论文内容完全无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文针对Android智能体在线强化学习训练效率低的问题，提出了Android Coach框架，通过Single State Multiple Actions范式实现了比传统方法更高的成功率和训练效率。

摘要翻译

在线强化学习是提升安卓智能体能力的有效方法。然而，由于模拟器的高延迟和现有强化学习算法的样本低效性，通过在线交互引导智能体学习的成本极高。我们发现当前方法存在一个根本性局限：单状态单动作范式。该范式仅利用在线单向轨迹产生的一对一状态-动作对来更新策略，未能充分探索每个高成本的模拟器状态。本文提出安卓教练，这是一种将训练范式转变为单状态多动作的新型框架，使智能体能够针对单个在线状态采样并利用多个动作。我们通过学习一个评估动作价值的评论家来实现这一目标，而无需增加额外的模拟器开销。为确保评论家能够作为可靠的教练，我们集成了过程奖励模型，并引入了一种基于评论家输出平均值的分组优势估计器。大量实验证明了安卓教练的有效性和高效性：在AndroidLab和AndroidWorld基准测试中，其成功率分别比UI-TARS-1.5-7B提升了7.5%和8.3%；在相同成功率条件下，其训练效率达到单状态单动作方法PPO和GRPO的1.4倍。

摘要 (Abstract)

Online reinforcement learning (RL) serves as an effective method for enhancing the capabilities of Android agents. However, guiding agents to learn through online interaction is prohibitively expensive due to the high latency of emulators and the sample inefficiency of existing RL algorithms. We identify a fundamental limitation in current approaches: the Single State Single Action paradigm, which updates the policy with one-to-one state-action pairs from online one-way rollouts without fully exploring each costly emulator state. In this paper, we propose Android Coach, a novel framework that shifts the training paradigm to Single State Multiple Actions, allowing the agent to sample and utilize multiple actions for a single online state. We enable this without additional emulator overhead by learning a critic that estimates action values. To ensure the critic serves as a reliable coach, we integrate a process reward model and introduce a group-wise advantage estimator based on the averaged critic outputs. Extensive experiments demonstrate the effectiveness and efficiency of Android Coach: it achieves 7.5% and 8.3% success rate improvements on AndroidLab and AndroidWorld over UI-TARS-1.5-7B, and attains 1.4x higher training efficiency than Single State Single Action methods PPO and GRPO at matched success rates.

关键词: Online Reinforcement Learning, Android Agents, Training Efficiency, Single State Multiple Actions, Critic Network, Advantage Estimator, Emulator Latency, Sample Efficiency

39. ❌ Making Room for AI: Multi-GPU Molecular Dynamics with Deep Potentials in GROMACS

作者: Luca Pennati, Andong Hu, Ivy Peng, Lukas Müllender, Stefano Markidis 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07276v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于将AI驱动的分子间势能（DeePMD-kit）集成到GROMACS分子动力学模拟软件中，以实现高性能的多GPU模拟。论文的核心是AI在科学计算（具体为生物物理/化学信息学）中的应用，因此仅与关键词’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（评分为10分）。其他关键词主要涉及大语言模型（LLMs）及其相关技术（如微调、推理优化、智能体等），而本文研究的是特定领域的深度学习模型（DeePMD）在科学模拟中的工程集成和性能优化，并未涉及LLMs或通用AI技术，因此其他所有关键词评分为0分。

!!! tip deepseek-chat TL;DR

该研究解决了将AI驱动的分子间势能（DeePMD-kit）高效集成到GROMACS分子动力学模拟中，以实现多GPU高性能计算的问题，结果表明在大规模系统上实现了接近从头计算精度的生产级分子动力学模拟。

摘要翻译

GROMACS是经典分子动力学模拟领域的事实标准。当前，人工智能驱动的原子间势函数在保持分子动力学计算通量的同时追求近量子精度，这带来了一个关键挑战：如何将神经网络推理嵌入多GPU模拟并维持高性能。本研究将机器学习势函数框架DeePMD-kit集成至GROMACS，实现了跨多节点系统的域分解式GPU加速推理。我们通过开发DeePMD后端扩展了GROMACS的NNPot接口，并引入与主模拟解耦的域分解层。推理过程在所有进程上并行执行，每步计算通过两次MPI集合操作分别广播原子坐标、汇总并重新分配作用力。我们基于溶剂化蛋白质片段数据集训练了内部DPA-1模型（含160万个参数）。首先在小型蛋白质体系验证实现正确性，随后在NVIDIA A100和AMD MI250x GPU集群（最多32个设备）上对包含15,668个原子的蛋白质体系进行GROMACS-DeePMD集成性能基准测试。强扩展效率在16设备时达66%，32设备时为40%；弱扩展效率在16设备时保持80%，在32设备时分别达到48%（MI250x）和40%（A100）。通过ROCm系统性能分析器剖析显示，>90%的壁时间用于DeePMD推理，而MPI集合操作耗时<10%——这主要因其作为全局同步点存在。主要瓶颈来自截断半径决定的不可约虚原子计算开销（经简单吞吐量模型证实）以及跨计算秩的负载不均衡。这些结果表明，在GROMACS中实现具有近从头计算精度的生产级分子动力学模拟已具备大规模可行性。

摘要 (Abstract)

GROMACS is a de-facto standard for classical Molecular Dynamics (MD). The rise of AI-driven interatomic potentials that pursue near-quantum accuracy at MD throughput now poses a significant challenge: embedding neural-network inference into multi-GPU simulations retaining high-performance. In this work, we integrate the MLIP framework DeePMD-kit into GROMACS, enabling domain-decomposed, GPU-accelerated inference across multi-node systems. We extend the GROMACS NNPot interface with a DeePMD backend, and we introduce a domain decomposition layer decoupled from the main simulation. The inference is executed concurrently on all processes, with two MPI collectives used each step to broadcast coordinates and to aggregate and redistribute forces. We train an in-house DPA-1 model (1.6 M parameters) on a dataset of solvated protein fragments. We validate the implementation on a small protein system, then we benchmark the GROMACS-DeePMD integration with a 15,668 atom protein on NVIDIA A100 and AMD MI250x GPUs up to 32 devices. Strong-scaling efficiency reaches 66% at 16 devices and 40% at 32; weak-scaling efficiency is 80% to 16 devices and reaches 48% (MI250x) and 40% (A100) at 32 devices. Profiling with the ROCm System profiler shows that >90% of the wall time is spent in DeePMD inference, while MPI collectives contribute <10%, primarily since they act as a global synchronization point. The principal bottlenecks are the irreducible ghost-atom cost set by the cutoff radius, confirmed by a simple throughput model, and load imbalance across ranks. These results demonstrate that production MD with near ab initio fidelity is feasible at scale in GROMACS.

关键词: Molecular Dynamics, GROMACS, DeePMD-kit, AI-driven interatomic potentials, Multi-GPU simulation, High-performance computing, Domain decomposition, Near-quantum accuracy

40. ❌ Validated Intent Compilation for Constrained Routing in LEO Mega-Constellations

作者: Yuanhang Li 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07264v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	10.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文的核心创新在于开发了一个端到端系统，其中LLM意图编译器使用few-shot prompting将自然语言转换为约束中间表示，并包含验证器反馈修复循环，这直接涉及LLM的应用和in-context learning技术。其他关键词如MoE、SFT、RAG、RLHF等未在摘要中提及，与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文解决了将LEO巨型星座中高层运营商意图安全编译为低层路由约束的问题，通过结合GNN路由器和LLM意图编译器，实现了零约束违反的高效路由系统。

摘要翻译

运行低地球轨道巨型星座需将高层级操作员意图（例如“将金融流量在80毫秒内从极地链路转出”）转化为低层级路由约束——这项任务同时需要自然语言理解与网络领域专业知识。我们提出一个端到端系统，包含三个核心组件：（1）图神经网络启发式路由器，将迪杰斯特拉算法级路由质量蒸馏至一个15.2万参数的图注意力网络，实现99.8%的数据包投递率与17倍推理加速；（2）大语言模型意图编译器，通过小样本提示结合验证器反馈修复循环，将自然语言转换为类型化约束中间表示，在包含240条意图的基准测试（193条可行，47条不可行）中实现98.4%的编译率，并对可行意图达到87.6%的完整语义匹配度；（3）八轮确定性验证器，配备构造性可行性证明机制，在所有47条不可行意图（30条标注+第8轮发现的17条）上实现0%不安全接受率，在240项结构破坏测试中达到100%的损坏检测率，并在15项针对性对抗攻击中保持100%检测精度。在四种约束路由场景下的端到端评估证实，两种路由器均实现零约束违反。我们进一步论证：极地规避场景中表现出的性能差距主要源于拓扑可达性上限而非路由质量，且大语言模型编译器在组合意图上的表现优于基于规则的基线方法46.2个百分点。本系统在弥合操作员意图与网络配置间语义鸿沟的同时，保持了实际部署所需的安全保障。

摘要 (Abstract)

Operating LEO mega-constellations requires translating high-level operator intents (“reroute financial traffic away from polar links under 80 ms”) into low-level routing constraints – a task that demands both natural language understanding and network-domain expertise. We present an end-to-end system comprising three components: (1) a GNN cost-to-go router that distills Dijkstra-quality routing into a 152K-parameter graph attention network achieving 99.8% packet delivery ratio with 17x inference speedup; (2) an LLM intent compiler that converts natural language to a typed constraint intermediate representation using few-shot prompting with a verifier-feedback repair loop, achieving 98.4% compilation rate and 87.6% full semantic match on feasible intents in a 240-intent benchmark (193 feasible, 47 infeasible); and (3) an 8-pass deterministic validator with constructive feasibility certification that achieves 0% unsafe acceptance on all 47 infeasible intents (30 labeled + 17 discovered by Pass 8), with 100% corruption detection across 240 structural corruption tests and 100% on 15 targeted adversarial attacks. End-to-end evaluation across four constrained routing scenarios confirms zero constraint violations with both routers. We further demonstrate that apparent performance gaps in polar-avoidance scenarios are largely explained by topological reachability ceilings rather than routing quality, and that the LLM compiler outperforms a rule-based baseline by 46.2 percentage points on compositional intents. Our system bridges the semantic gap between operator intent and network configuration while maintaining the safety guarantees required for operational deployment.

关键词: LEO mega-constellations, constrained routing, LLM intent compiler, few-shot prompting, GNN router, natural language understanding, safety validation, network configuration

41. ❌ Designing Safe and Accountable GenAI as a Learning Companion with Women Banned from Formal Education

作者: Hamayoon Behmanush, Freshta Akhtari, Ingmar Weber, Vikram Kamath Cannanure 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07253v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究Generative AI（GenAI）作为女性学习伴侣的应用，属于大模型在不同领域的研究应用，因此与’Large Language Models OR LLMs OR Foundation Models’有一定关联（5分）。但论文聚焦于社会应用、参与式设计和安全性，而非大模型技术原理的创新，因此与其余技术性关键词（如MoE、Scaling Laws、训练方法、推理优化、代理系统等）完全无关（0分）。‘AI for Science’关键词特指科学领域（如生物信息学），而本文属于社会科学/教育应用，不相关。

!!! tip deepseek-chat TL;DR

该研究通过参与式设计探讨了在阿富汗等性别限制环境中，女性如何将生成式AI（GenAI）视为学习伴侣和职业指导来源，并提出了以安全、问责和教学对齐为核心的设计方向，同时发现这种设计过程能显著提升参与者的抱负、能动性和感知途径。

摘要翻译

在性别限制与监控环境中，女性获取正规教育的机会常受限制，追求教育需承担安全与隐私风险。当女性被排除在学校和大学体系之外时，她们往往转向在线自学和生成式人工智能（GenAI）以实现教育与职业抱负。然而，在监控环境、家庭责任以及学习社区缺失的背景下，何种生成式人工智能支持才是安全且负责任的，我们知之甚少。本研究基于一项招募调查（n=140），对阿富汗20名女性开展了远程参与式设计研究，探讨参与者如何构想用于学习与就业能力提升的生成式人工智能。参与者描述称，她们较少将生成式人工智能视为信息源，而更多视其为一种随时可用的同伴、导师和职业指导来源，以弥补学习社区的缺失。与此同时，她们强调这种陪伴关系受到以下因素制约：隐私与监控风险、脱离实际情境与文化安全性的支持，以及可能通过制造进步假象而损害学习的直接答案式交互。除需求挖掘外，通过参与式设计展望生成式人工智能的未来，与参与者的抱负感（p=0.01）、自我效能感（p=0.01）及机会感知（p=0.03）的显著提升呈正相关。这些结果表明，负责任且安全的生成式人工智能不仅关乎风险降低，更能积极赋能女性，帮助她们构想并追求可行的学习与就业前景。基于此，我们将参与者的建议转化为以问责为核心的设计方向：聚焦安全优先的交互与用户控制、资源受限情境下的本土化支持，以及提供符合教学原理的协助——这种协助旨在促进真实学习而非快速获取答案。

摘要 (Abstract)

In gender-restrictive and surveilled contexts, where access to formal education may be restricted for women, pursuing education involves safety and privacy risks. When women are excluded from schools and universities, they often turn to online self-learning and generative AI (GenAI) to pursue their educational and career aspirations. However, we know little about what safe and accountable GenAI support is required in the context of surveillance, household responsibilities, and the absence of learning communities. We present a remote participatory design study with 20 women in Afghanistan, informed by a recruitment survey (n = 140), examining how participants envision GenAI for learning and employability. Participants describe using GenAI less as an information source and more as an always-available peer, mentor, and source of career guidance that helps compensate for the absence of learning communities. At the same time, they emphasize that this companionship is constrained by privacy and surveillance risks, contextually unrealistic and culturally unsafe support, and direct-answer interactions that can undermine learning by creating an illusion of progress. Beyond eliciting requirements, envisioning the future with GenAI through participatory design was positively associated with significant increases in participants’ aspirations (p=.01), perceived agency (p=.01), and perceived avenues (p=.03). These outcomes show that accountable and safe GenAI is not only about harm reduction but can also actively enable women to imagine and pursue viable learning and employment futures. Building on this, we translate participants’ proposals into accountability-focused design directions that center on safety-first interaction and user control, context-grounded support under constrained resources, and offer pedagogically aligned assistance that supports genuine learning rather than quick answers.

关键词: Generative AI, learning companion, participatory design, safety and accountability, women’s education, Afghanistan, privacy and surveillance, career guidance

42. ❌ $k$-server-bench: Automating Potential Discovery for the $k$-Server Conjecture

作者: Kirill Brilliantov, Etienne Bamas, Emmanuel Abbé 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07240v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是k-server猜想这一理论计算机科学问题，并开发了一个基于代码的自动化发现基准测试。虽然论文提到了使用’agentic methods’（智能体方法）进行实验，但全文没有涉及任何大模型、深度学习、语言模型或AI技术原理的内容。所有评分关键词都专注于大模型和深度学习技术，而本文的核心是数学猜想和算法基准测试，与这些关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了一个基于k-server猜想的自动化数学发现基准测试，实验表明当前智能体方法可以解决非平凡实例但尚未完全解决开放问题，为代码发现智能体开发提供了有用的基准。

摘要翻译

我们提出一项基于代码的挑战，旨在推动自动化、开放式的数学发现，该挑战围绕$k$服务器猜想——竞争分析领域的一个核心未解问题展开。任务目标是发现一个满足大型图结构简单线性不等式系统的势函数。由此构建的评估流程是可靠但不完备的：任何被违反的不等式都能明确反驳候选函数，而满足所有不等式本身并不构成对应猜想特例的证明。尽管如此，一个通过所有约束的候选函数将成为有效证明的有力证据，且据我们所知，在开放的$k=4$圆形场景下，当前已知的势函数均无法在我们的框架下实现这一目标。因此，成功的候选函数本身即是对$k$服务器猜想的有趣贡献，若与完整证明结合，可能成为重要的理论成果。
在已解决的$k=3$场景中的实验表明，现有智能体方法能够解决非平凡实例；而在开放的$k=4$场景中，这些方法相对于现有势函数减少了违反约束的数量，但未能完全解决该任务。总体来看，这些结果说明该任务具有挑战性，但很可能处于当前方法可触及的范围内。
除了对$k$服务器研究领域的意义——所开发的工具使研究人员能够检验新假设并有望改进当前记录——该任务还可作为开发基于代码的发现智能体的有效基准测试。特别地，我们的$k=3$实验结果表明，该任务缓解了现有开放式代码基准测试的重要局限，包括早期饱和现象以及朴素随机基线方法与复杂方法之间区分度不足的问题。

摘要 (Abstract)

We introduce a code-based challenge for automated, open-ended mathematical discovery based on the $k$-server conjecture, a central open problem in competitive analysis. The task is to discover a potential function satisfying a large graph-structured system of simple linear inequalities. The resulting evaluation procedure is sound but incomplete: any violated inequality definitively refutes a candidate, whereas satisfying all inequalities does not by itself constitute a proof of the corresponding conjecture’s special case. Nevertheless, a candidate that passes all constraints would be strong evidence toward a valid proof and, to the best of our knowledge, no currently known potential achieves this under our formulation in the open $k=4$ circle case. As such, a successful candidate would already be an interesting contribution to the $k$-server conjecture, and could become a substantial theoretical result when paired with a full proof. Experiments on the resolved $k=3$ regime show that current agentic methods can solve nontrivial instances, and in the open $k=4$ regime they reduce the number of violations relative to existing potentials without fully resolving the task. Taken together, these results suggest that the task is challenging but plausibly within reach of current methods. Beyond its relevance to the $k$-server community, where the developed tooling enables researchers to test new hypotheses and potentially improve on the current record, the task also serves as a useful \emph{benchmark} for developing code-based discovery agents. In particular, our $k=3$ results show that it mitigates important limitations of existing open-ended code-based benchmarks, including early saturation and the weak separation between naive random baselines and more sophisticated methods.

关键词: k-server conjecture, automated mathematical discovery, potential function, competitive analysis, code-based benchmark, agentic methods, linear inequalities, theoretical computer science

43. ❌ How Much LLM Does a Self-Revising Agent Actually Need?

作者: Seongwoo Jeong, Seonil Son 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07236v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在自主代理中的作用分解，高度相关关键词包括：LLM（核心研究对象）、Self-Reflection（研究反思机制）、LLM Agents（研究代理架构）、World Models（研究世界建模）。中等相关关键词包括：Chain of Thought/System 2 Thinking（涉及推理过程）、Multi-agent Systems（使用协作游戏）、Explainable AI（使代理行为可检查）。其他关键词与论文的技术实现、训练方法、特定应用领域无关。

!!! tip deepseek-chat TL;DR

该论文研究了如何通过外部化反思机制来分解LLM在自主代理中的具体作用，发现显式的世界模型规划能显著提升性能，而稀疏的LLM修订仅带来微小且非单调的变化。

摘要翻译

近期基于大语言模型（LLM）的智能体通常将世界建模、规划与反思置于单一语言模型循环内。这种做法虽能产生有效行为，却使一个基础科学问题难以回答：智能体的能力究竟有多少真正源自大语言模型，又有多少源自其外部的显式结构？
我们并不试图给出普适性答案，而是通过使该问题具备实证可操作性来展开研究。我们提出一种声明式反思运行时协议，将智能体状态、置信度信号、防护性动作及假设性状态转移外化为可检视的运行时结构。我们在一个声明式运行时中实例化了该协议，并在噪声环境下的协作战舰游戏[4]中进行了评估，使用四种渐进结构化智能体完成了54局游戏（18个游戏板×3次随机种子）。
由此实现的分解隔离出四个组件：后验信念追踪、显式世界模型规划、符号化回合内反思以及基于大语言模型的稀疏修正。在此分解框架下，显式世界模型规划相比贪婪遵循后验的基线有显著提升（胜率+24.1个百分点，F1分数+0.017）。符号化反思作为一种真实的运行时机制运行——包含预测追踪、置信度门控及防护性修正动作——尽管其当前预设的修正规则在整体上尚未产生净正向收益。在约4.3%的决策回合中引入条件性大语言模型修正仅带来微小且非单调的变化：平均F1分数微升（+0.005），而胜率下降（54局中从31胜降至29胜）。
这些结果指向方法论层面的贡献而非性能排名主张：将反思过程外部化，使得原本隐性的智能体行为转化为可检视的运行时结构，从而能够直接研究大语言模型干预的边际作用。

摘要 (Abstract)

Recent LLM-based agents often place world modeling, planning, and reflection inside a single language model loop. This can produce capable behavior, but it makes a basic scientific question difficult to answer: which part of the agent’s competence actually comes from the LLM, and which part comes from explicit structure around it? We study this question not by claiming a general answer, but by making it empirically tractable. We introduce a declared reflective runtime protocol that externalizes agent state, confidence signals, guarded actions, and hypothetical transitions into inspectable runtime structure. We instantiate this protocol in a declarative runtime and evaluate it on noisy Collaborative Battleship [4] using four progressively structured agents over 54 games (18 boards $\times$ 3 seeds). The resulting decomposition isolates four components: posterior belief tracking, explicit world-model planning, symbolic in-episode reflection, and sparse LLM-based revision. Across this decomposition, explicit world-model planning improves substantially over a greedy posterior-following baseline (+24.1pp win rate, +0.017 F1). Symbolic reflection operates as a real runtime mechanism – with prediction tracking, confidence gating, and guarded revision actions – even though its current revision presets are not yet net-positive in aggregate. Adding conditional LLM revision at about 4.3% of turns yields only a small and non-monotonic change: average F1 rises slightly (+0.005) while win rate drops (31$\rightarrow$29 out of 54). These results suggest a methodological contribution rather than a leaderboard claim: externalizing reflection turns otherwise latent agent behavior into inspectable runtime structure, allowing the marginal role of LLM intervention to be studied directly.

关键词: LLM-based agents, self-revising agent, world modeling, reflection, runtime protocol, agent decomposition, marginal role of LLM, Collaborative Battleship

44. ❌ Mixture Proportion Estimation and Weakly-supervised Kernel Test for Conditional Independence

作者: Yushi Hirose, Akito Narahara, Takafumi Kanamori 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07191v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于弱监督学习中的混合比例估计（MPE）和条件独立性检验，属于机器学习统计方法领域。论文与绝大多数大模型/深度学习技术关键词完全无关，仅与’Domain Adaptation’有一定关联（得5分），因为MPE是领域适应的一个组件，但论文本身不涉及大模型或深度学习。

!!! tip deepseek-chat TL;DR

该论文提出了基于条件独立性的新假设来估计混合比例，并开发了相应的估计方法和弱监督核检验，在不可约性假设不成立时仍能保证可识别性。

摘要翻译

混合比例估计（MPE）旨在从无标注数据中估计类别先验。该任务是弱监督学习中的关键组成部分，例如PU学习、带标签噪声的学习以及领域自适应。现有的MPE方法依赖于可识别性所需的“不可约性”假设或其变体。本文中，我们提出了基于给定类别标签条件下条件独立性（CI）的新假设，这些假设即使在不可约性不成立时也能确保可识别性。我们在这些假设下开发了矩估计方法，并分析了其渐近性质。此外，我们提出了弱监督核检验方法来验证CI假设，这些方法在因果发现和公平性评估等应用中具有独立价值。通过实证研究，我们证明了所提估计器相较于现有方法的性能提升，并且我们的检验方法能成功控制I类错误和II类错误。

摘要 (Abstract)

Mixture proportion estimation (MPE) aims to estimate class priors from unlabeled data. This task is a critical component in weakly supervised learning, such as PU learning, learning with label noise, and domain adaptation. Existing MPE methods rely on the \textit{irreducibility} assumption or its variant for identifiability. In this paper, we propose novel assumptions based on conditional independence (CI) given the class label, which ensure identifiability even when irreducibility does not hold. We develop method of moments estimators under these assumptions and analyze their asymptotic properties. Furthermore, we present weakly-supervised kernel tests to validate the CI assumptions, which are of independent interest in applications such as causal discovery and fairness evaluation. Empirically, we demonstrate the improved performance of our estimators compared with existing methods and that our tests successfully control both type I and type II errors.\label{key}

关键词: Mixture proportion estimation, Weakly supervised learning, Conditional independence, Method of moments, Kernel test, Identifiability, Domain adaptation, PU learning

45. ❌ The ATOM Report: Measuring the Open Language Model Ecosystem

作者: Nathan Lambert, Florian Brand 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07190v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文是对开放语言模型生态系统的测量报告，主要关注模型采用情况、下载量、市场份额和性能指标，而非具体的技术创新或应用。因此，仅与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为论文核心是分析LLM生态系统。其他关键词涉及具体技术原理、训练方法、应用领域或优化技术，论文未深入探讨，故均评为0分。

!!! tip deepseek-chat TL;DR

该研究通过分析Hugging Face下载量、模型衍生品、推理市场份额和性能指标，全面测量了开放语言模型生态系统，发现中国模型在2025年夏季超越并拉大了与美国模型的差距。

摘要翻译

本文呈现了主流开源语言模型的采用概况及其研发主体，重点关注阿里巴巴的Qwen、DeepSeek、Meta的Llama等约1,500个核心开源模型，这些模型构成了对研究人员、创业者和政策顾问至关重要的生态系统基础。我们记录了一个显著趋势：中国模型在2025年夏季于采用度上超越美国同类模型，并随后进一步拉大了与西方模型的差距。本研究综合分析了Hugging Face平台下载量、模型衍生版本、推理市场份额、性能指标等多维度数据，以全面描绘该生态系统的现状。

摘要 (Abstract)

We present a comprehensive adoption snapshot of the leading open language models and who is building them, focusing on the ~1.5K mainline open models from the likes of Alibaba’s Qwen, DeepSeek, Meta’s Llama, that are the foundation of an ecosystem crucial to researchers, entrepreneurs, and policy advisors. We document a clear trend where Chinese models overtook their counterparts built in the U.S. in the summer of 2025 and subsequently widened the gap over their western counterparts. We study a mix of Hugging Face downloads and model derivatives, inference market share, performance metrics and more to make a comprehensive picture of the ecosystem.

关键词: open language models, ecosystem measurement, Hugging Face downloads, inference market share, performance metrics, Chinese models, adoption snapshot, model derivatives

46. ❌ TeaLeafVision: An Explainable and Robust Deep Learning Framework for Tea Leaf Disease Classification

作者: Rafi Ahamed, Sidratul Moon Nafsin, Md Abir Rahman, Tasnia Tarannum Roza, Munaia Jannat Easha, Abu Raihan 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07182v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于使用卷积神经网络（CNN）进行茶叶病害分类，属于计算机视觉和农业AI应用领域。论文未涉及任何大语言模型（LLM）相关技术，因此绝大多数关键词（如LLMs、MoE、RLHF、RAG等）完全无关，评分为0。唯一相关的关键词是’Mechanistic Interpretability OR Explainable AI’（论文使用了Grad-CAM和遮挡敏感性分析来提高模型可解释性）和’AI for Science OR Bioinformatics OR Cheminformatics’（论文属于AI在农业科学中的应用），但这两者并非论文核心，只是辅助技术或应用领域，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该研究开发了一个基于深度学习的可解释框架TeaLeafVision，用于茶叶病害分类，在teaLeafBD数据集上使用DenseNet201模型达到了99%的测试准确率，并通过Grad-CAM和对抗训练增强了模型的可解释性和鲁棒性。

摘要翻译

作为仅次于水的世界第二大消费饮品，茶不仅是一种文化支柱，更是具有深远规模和影响力的全球经济力量。它远不止是一种饮料，更代表了自然、文化与人类对片刻沉思渴望之间的一种静默协商。因此，茶叶病害的精确识别与检测至关重要。基于此目标，我们在teaLeafBD数据集上评估了多种卷积神经网络（CNN）模型，其中DenseNet201、MobileNetV2和InceptionV3三种模型表现出显著性能。teaLeafBD数据集包含七个类别——六类病害叶片及一类健康叶片，其采集于多种田间条件，反映了实际应用中的挑战。在CNN模型中，DenseNet201取得了最高的测试准确率，达到99%。为增强模型的可靠性与可解释性，我们采用了梯度加权类激活映射（Grad CAM）、遮挡敏感性分析及对抗训练技术，以提高模型的抗噪能力。最后，我们开发了一个原型系统，旨在将模型能力应用于现实农业场景。本文阐述了深度学习模型在实际茶叶病害检测与管理中进行分类的潜力。

摘要 (Abstract)

As the worlds second most consumed beverage after water, tea is not just a cultural staple but a global economic force of profound scale and influence. More than a mere drink, it represents a quiet negotiation between nature, culture, and the human desire for a moment of reflection. So, the precise identification and detection of tea leaf disease is crucial. With this goal, we have evaluated several Convolutional Neural Networks (CNN) models, among them three shows noticeable performance including DenseNet201, MobileNetV2, InceptionV3 on the teaLeafBD dataset. teaLeafBD dataset contains seven classes, six disease classes and one healthy class, collected under various field conditions reflecting real world challenges. Among the CNN models, DenseNet201 has achieved the highest test accuracy of 99%. In order to enhance the model reliability and interpretability, we have implemented Gradient weighted Class Activation Mapping (Grad CAM), occlusion sensitivity analysis and adversarial training techniques to increase the noise resistance of the model. Finally, we have developed a prototype in order to leverage the models capabilities on real life agriculture. This paper illustrates the deep learning models capabilities to classify the disease in real life tea leaf disease detection and management.

关键词: Tea leaf disease classification, Deep learning, Convolutional Neural Networks, DenseNet201, Explainable AI, Grad-CAM, Adversarial training, Agricultural AI

47. ❌ Reason in Chains, Learn in Trees: Self-Rectification and Grafting for Multi-turn Agent Policy Optimization

作者: Yu Li, Sizhe Tang, Tian Lan 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07165v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM智能体在多步推理任务中的强化学习优化，与LLM、RLHF、CoT推理、系统2深度思考、自我修正和LLM智能体等关键词高度相关（10分）。论文未涉及MoE、SLM、预训练、量化、科学AI等其他技术方向，这些关键词得0分。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型智能体在多步推理任务中因稀疏奖励导致的训练低效问题，提出了T-STAR框架，通过构建认知树、回溯评估和思想嫁接等方法优化策略梯度，在多个基准测试中实现了显著性能提升。

摘要翻译

在多步推理任务中，强化学习对于大型语言模型智能体的训练常因奖励稀疏而受阻。现有方法如组相对策略优化将采样轨迹视为独立链，为每条链中的所有步骤分配均匀的贡献度，忽视了可能对推理结果产生不成比例影响的关键步骤的存在。本文提出T-STAR（树状结构自校正智能体）框架，该框架能够从看似独立的轨迹中恢复潜在的关联奖励结构。具体而言，我们通过识别并合并功能相似的步骤/节点，将多条轨迹整合为统一的认知树。这一结构支持内省评估机制，该机制将轨迹级奖励沿树结构反向传播，从而在步骤层面获得一种方差缩减的相对优势新度量。利用认知树，我们还开发了上下文内思维嫁接技术，通过对比关键分歧点/步骤上的成功与失败分支来合成纠正性推理。随后，我们提出的精准策略优化通过一种基于Bradley-Terry模型的精准损失函数，充分利用集中在这些关键点/步骤的丰富策略梯度信息。在具身交互、推理和规划等多个基准测试上的广泛实验表明，T-STAR相较于强基线模型实现了持续的性能提升，在需要长链推理的任务上增益尤为显著。

摘要 (Abstract)

Reinforcement learning for Large Language Model agents is often hindered by sparse rewards in multi-step reasoning tasks. Existing approaches like Group Relative Policy Optimization treat sampled trajectories as independent chains, assigning uniform credit to all steps in each chain and ignoring the existence of critical steps that may disproportionally impact reasoning outcome. In this paper, we propose T-STAR(Tree-structured Self-Taught Agent Rectification), a framework that recovers the latent correlated reward structure across seemingly independent trajectories. Specifically, we consolidate trajectories into a unified Cognitive Tree by identifying and merging functionally similar steps/nodes. It enables an Introspective Valuation mechanism that back-propagates trajectory-level rewards through the tree to obtain a new notion of variance-reduced relative advantage at step-level. Using the Cognitive Tree, we also develop In-Context Thought Grafting to synthesize corrective reasoning by contrasting successful and failed branches at critical divergence points/steps. Our proposed Surgical Policy Optimization then capitalizes on the rich policy gradient information concentrated at these critical points/steps through a Bradley-Terry type of surgical loss. Extensive experiments across embodied, interactive, reasoning, and planning benchmarks demonstrate that T-STAR achieves consistent improvements over strong baselines, with gains most pronounced on tasks requiring extended reasoning chains.

关键词: Large Language Model Agents, Multi-step Reasoning, Reinforcement Learning, Cognitive Tree, Self-Rectification, Policy Optimization, Sparse Rewards, Trajectory Analysis

48. ❌ Energy-based Tissue Manifolds for Longitudinal Multiparametric MRI Analysis

作者: Kartikay Tehlan, Lukas Förner, Nico Schmutzenhofer, Michael Frühwald, Matthias Wagner, Nassir Navab, Thomas Wendler 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07180v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文提出了一种基于能量建模的几何框架，用于纵向多参数MRI分析，属于医学影像分析领域。论文使用了隐式神经表示和去噪分数匹配等技术，但核心内容与深度学习在科学领域的应用（特别是生物医学AI）相关。所有关键词中，只有"AI for Science OR Bioinformatics OR Cheminformatics"与论文主题有一定关联（医学影像分析属于生物信息学/科学AI的应用领域），因此给予5分。其他关键词（如LLMs、MoE、RLHF、RAG等）均与论文内容完全无关，论文未涉及大模型、语言模型、对齐、推理、代理、压缩等主题，因此评分为0分。加权总分计算为：5.0（AI for Science相关度）× 1.0（权重）= 5.0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于患者特异性能量建模的几何框架，用于纵向多参数MRI分析，通过隐式神经表示学习能量函数来表征组织状态，并在儿科病例中证明了该方法能追踪疾病复发前的组织变化，为神经肿瘤学中的组织风险跟踪提供了新方法。

摘要翻译

我们提出一种基于序列空间中患者特异性能量建模的纵向多参数磁共振成像分析几何框架。该方法不采用空间网络对图像进行操作，而是将每个体素表示为其多序列强度向量（$T1$、$T1c$、$T2$、FLAIR、ADC），并通过去噪分数匹配训练一个紧凑的隐式神经表示，从单次基线扫描中学习定义在$\mathbb{R}^d$上的能量函数$E_θ(\mathbf{u})$。习得的能量景观为组织状态提供了无需分割标签的微分几何描述：局部极小值定义了组织盆地，梯度幅值反映了与状态边界的接近程度，拉普拉斯曲率则表征了局部约束结构。重要的是，该基线能量流形被视为固定的几何参考系：它编码了诊断时观察到的对比度组合集合，在随访中不再重新训练。因此，纵向评估被表述为后续扫描相对于该基线几何结构的评价。我们不再比较解剖分割结果，而是分析磁共振序列向量分布在基线能量函数下的演化规律。在一例后期复发的儿科病例中，随访扫描显示在放射学明确再现前，序列空间中的能量已出现渐进性偏离，并朝基线肿瘤相关状态发生定向位移。在一例疾病稳定的病例中，体素分布始终局限于已建立的低声能盆地内，未出现系统性漂移。这些案例初步验证了患者特异性能量流形可作为纵向多参数磁共振成像分析的几何参考系，无需显式分割或有监督分类，为神经肿瘤学中基于流形的风险组织追踪研究奠定了基础。

摘要 (Abstract)

We propose a geometric framework for longitudinal multi-parametric MRI analysis based on patient-specific energy modelling in sequence space. Rather than operating on images with spatial networks, each voxel is represented by its multi-sequence intensity vector ($T1$, $T1c$, $T2$, FLAIR, ADC), and a compact implicit neural representation is trained via denoising score matching to learn an energy function $E_θ(\mathbf{u})$ over $\mathbb{R}^d$ from a single baseline scan. The learned energy landscape provides a differential-geometric description of tissue regimes without segmentation labels. Local minima define tissue basins, gradient magnitude reflects proximity to regime boundaries, and Laplacian curvature characterises local constraint structure. Importantly, this baseline energy manifold is treated as a fixed geometric reference: it encodes the set of contrast combinations observed at diagnosis and is not retrained at follow-up. Longitudinal assessment is therefore formulated as evaluation of subsequent scans relative to this baseline geometry. Rather than comparing anatomical segmentations, we analyse how the distribution of MRI sequence vectors evolves under the baseline energy function. In a paediatric case with later recurrence, follow-up scans show progressive deviation in energy and directional displacement in sequence space toward the baseline tumour-associated regime before clear radiological reappearance. In a case with stable disease, voxel distributions remain confined to established low-energy basins without systematic drift. The presented cases serve as proof-of-concept that patient-specific energy manifolds can function as geometric reference systems for longitudinal mpMRI analysis without explicit segmentation or supervised classification, providing a foundation for further investigation of manifold-based tissue-at-risk tracking in neuro-oncology.

关键词: energy-based modeling, longitudinal MRI analysis, implicit neural representation, denoising score matching, tissue manifolds, neuro-oncology, multi-parametric MRI, geometric framework

49. ❌ Bridging MRI and PET physiology: Untangling complementarity through orthogonal representations

作者: Sonja Adomeit, Kartikay Tehlan, Lukas Förner, Katharina Weisser, Helen Scholtiseek, David Kaufmann, Julie Steinestel, Constantin Lapa, Thomas Kröncke, Thomas Wendler 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07154v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于医学影像分析（MRI和PET融合），提出了一种基于正交子空间分解的多模态融合框架，并使用了隐式神经表示（INR）等技术。论文内容与绝大多数关键词（涉及大模型、深度学习技术原理、训练方法、推理优化、智能体等）完全无关，因为这些关键词主要针对自然语言处理、大语言模型及其相关技术。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究属于AI在生物医学/科学领域的应用（具体是医学影像分析），因此给予5分（有一定关联），但并非其核心创新点（核心是医学影像融合方法）。

!!! tip deepseek-chat TL;DR

该研究提出了一种正交子空间分解框架，用于区分多模态医学影像（MRI和PET）中共享与模态特异的信息，结果表明PSMA PET包含无法从MRI生理描述中恢复的信号成分，从而为模态互补性提供了基于表示几何的结构化表征。

摘要翻译

多模态成像分析通常依赖于联合潜在表征，但这些方法很少明确定义哪些信息是共享的、哪些是模态特有的。澄清这种区分具有临床相关性，因为它界定了每种模态不可替代的贡献，并为合理的采集策略提供依据。我们提出了一种子空间分解框架，将多模态融合重新定义为正交子空间分离问题，而非模态转换问题。我们将前列腺特异性膜抗原（PSMA）PET摄取分解为可由磁共振成像（MRI）解释的生理包络，以及一个反映在MRI特征流形中无法表达的信号成分的正交残差。利用多参数MRI，我们训练了一个基于强度、非空间性的隐式神经表征（INR），将MRI特征向量映射到PET摄取。我们引入了一种基于奇异值分解的投影正则化方法，以惩罚位于MRI特征流形张成空间内的残差成分。这强制实现了组织水平生理特性（结构、扩散、灌注）与细胞内PSMA表达之间的数学正交性。在13名前列腺癌患者数据上的测试表明，模型能够将由MRI特征张成的残差成分吸收到学习到的包络中，而正交残差在肿瘤区域最大。这表明PSMA PET包含无法从MRI衍生的生理描述符中恢复的信号成分。由此产生的分解提供了一种基于表征几何而非图像转换的多模态互补性结构化表征方法。

摘要 (Abstract)

Multimodal imaging analysis often relies on joint latent representations, yet these approaches rarely define what information is shared versus modality-specific. Clarifying this distinction is clinically relevant, as it delineates the irreducible contribution of each modality and informs rational acquisition strategies. We propose a subspace decomposition framework that reframes multimodal fusion as a problem of orthogonal subspace separation rather than translation. We decompose Prostate-Specific Membrane Antigen (PSMA) PET uptake into an MRI-explainable physiological envelope and an orthogonal residual reflecting signal components not expressible within the MRI feature manifold. Using multiparametric MRI, we train an intensity-based, non-spatial implicit neural representation (INR) to map MRI feature vectors to PET uptake. We introduce a projection-based regularization using singular value decomposition to penalize residual components lying within the span of the MRI feature manifold. This enforces mathematical orthogonality between tissue-level physiological properties (structure, diffusion, perfusion) and intracellular PSMA expression. Tested on 13 prostate cancer patients, the model demonstrates that residual components spanned by MRI features are absorbed into the learned envelope, while the orthogonal residual is largest in tumour regions. This indicates that PSMA PET contains signal components not recoverable from MRI-derived physiological descriptors. The resulting decomposition provides a structured characterization of modality complementarity grounded in representation geometry rather than image translation.

关键词: multimodal imaging, MRI-PET fusion, orthogonal subspace decomposition, implicit neural representation, prostate cancer, PSMA PET, modality complementarity, representation geometry

50. ❌ Dynamic Context Evolution for Scalable Synthetic Data Generation

作者: Ryan Lingo, Rajeev Chhajer 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07147v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在批量生成中的多样性问题，提出Dynamic Context Evolution框架，涉及LLM自评估（verbalized tail sampling）和语义记忆机制，与’Large Language Models’高度相关（10分），‘Self-Correction OR Self-Improvement OR Self-Reflection’有一定关联（8分），因为方法包含模型自我评估过滤明显想法。其他关键词如MoE、SFT、RAG等未涉及，均给0分。

!!! tip deepseek-chat TL;DR

论文针对大语言模型在批量独立提示时产生的跨批次模式崩溃问题，提出了Dynamic Context Evolution框架，通过自评估过滤、语义记忆和自适应提示演化，显著提高了生成数据的多样性和概念丰富性。

摘要翻译

大语言模型在多个批次中独立接收提示时会产生重复输出，这种现象我们称为跨批次模式坍缩：当语言模型在无法获取先前生成内容的情况下被重复提示时，其输出多样性逐渐丧失。实践者长期依赖临时去重和种子轮换策略来缓解此问题，但缺乏系统性的理论框架。我们提出动态上下文演化（DCE）框架，包含三种机制：（1）言语化尾部采样（模型通过自我评估为每个想法标注其明显程度估计值，并丢弃明显想法），该机制通过模型自评估过滤高概率候选；（2）语义记忆，通过维护持久化的嵌入索引来拒绝跨批次的近似重复内容；（3）自适应提示演化，基于记忆状态和轮换多样性策略在每批次重构生成提示。在三个领域（可持续包装概念、教育考试题目和创意写作提示）及两个模型系列（gpt-5-mini与claude-haiku-4-5）的实验中，每种方法采用2-3个随机种子的组件消融研究表明：DCE实现了0.0 +/- 0.0%的坍缩率，而朴素提示法为5.6 +/- 2.0%；同时每个种子产生17-18个HDBSCAN聚类，而朴素方法波动于2-17个，表明DCE能稳定生成更丰富的概念结构。这些结果通过独立嵌入模型（all-MiniLM-L6-v2）验证，且在VTS阈值τ和去重阈值δ的敏感性扫描中保持稳定。去重和提示演化单独使用效果不足但联合应用显著有效，仅需标准API调用即可实现每千名候选者约0.50美元的成本，无需微调或定制架构。

摘要 (Abstract)

Large language models produce repetitive output when prompted independently across many batches, a phenomenon we term cross-batch mode collapse: the progressive loss of output diversity when a language model is prompted repeatedly without access to its prior generations. Practitioners have long mitigated this with ad hoc deduplication and seed rotation, but no principled framework exists. We introduce Dynamic Context Evolution (DCE), comprising three mechanisms: (1) verbalized tail sampling (the model labels each idea with a guess about how obvious it is, and obvious ideas are discarded), which filters high-probability candidates via model self-assessment; (2) semantic memory, which maintains a persistent embedding index to reject near-duplicates across batches; and (3) adaptive prompt evolution, which reconstructs the generation prompt each batch using memory state and rotating diversity strategies. In experiments across three domains (sustainable packaging concepts, educational exam questions, and creative writing prompts) and two model families (gpt-5-mini and claude-haiku-4-5), a component ablation across 2-3 random seeds per method shows that DCE achieves 0.0 +/- 0.0% collapse versus 5.6 +/- 2.0% for naive prompting, while producing 17-18 HDBSCAN clusters per seed versus naive’s volatile 2-17, indicating reliably richer conceptual structure. These results are validated with an independent embedding model (all-MiniLM-L6-v2) and hold across sensitivity sweeps of the VTS threshold tau and dedup threshold delta. Deduplication and prompt evolution are individually insufficient but jointly effective, at approximately $0.50 per 1,000 candidates using only standard API calls, with no fine-tuning or custom architectures required.

关键词: Large Language Models, synthetic data generation, cross-batch mode collapse, Dynamic Context Evolution, verbalized tail sampling, semantic memory, adaptive prompt evolution, output diversity

51. ❌ Energy Saving for Cell-Free Massive MIMO Networks: A Multi-Agent Deep Reinforcement Learning Approach

作者: Qichen Wang, Keyu Li, Ozan Alp Topal, Özlem Tugfe Demir, Mustafa Ozger, Cicek Cavdar 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07133v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是无线通信网络（cell-free massive MIMO）中的节能问题，采用多智能体深度强化学习（MADRL）方法。所有关键词均与大模型、深度学习技术原理或AI for Science相关，但论文内容完全不涉及大模型（LLM）、深度学习技术原理（如MoE、Scaling Laws、Attention机制、微调方法等）或科学AI应用（如生物信息学）。唯一相关的关键词是“Multi-agent Systems OR Agent Coordination”，因为论文明确使用了多智能体系统（multi-agent deep reinforcement learning），且这是其核心方法，因此给予10分。其他关键词均与论文主题无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种多智能体深度强化学习算法，用于动态流量条件下无蜂窝大规模MIMO网络的节能控制，实现了56.23%的功耗降低和较低的掉线率。

摘要翻译

本文针对动态流量条件下无蜂窝大规模多输入多输出（CF mMIMO）网络下行链路运行中的节能问题展开研究。我们提出了一种多智能体深度强化学习（MADRL）算法，使每个接入点（AP）能够自主控制天线重配置与高级休眠模式（ASM）选择。经过训练后，所提框架以完全分布式方式运行，无需集中控制，并允许每个AP根据实时流量波动进行动态调整。仿真结果表明，与未采用任何节能方案的系统相比，所提算法可降低56.23%的功耗（PC）；相较于仅采用最浅休眠模式的非学习机制，功耗降低30.12%，而掉线率仅轻微上升。此外，与广泛使用的深度Q网络（DQN）算法相比，该算法在达到相近功耗水平的同时，实现了显著更低的掉线率。

摘要 (Abstract)

This paper focuses on energy savings in downlink operation of cell-free massive MIMO (CF mMIMO) networks under dynamic traffic conditions. We propose a multi-agent deep reinforcement learning (MADRL) algorithm that enables each access point (AP) to autonomously control antenna re-configuration and advanced sleep mode (ASM) selection. After the training process, the proposed framework operates in a fully distributed manner, eliminating the need for centralized control and allowing each AP to dynamically adjust to real-time traffic fluctuations. Simulation results show that the proposed algorithm reduces power consumption (PC) by 56.23% compared to systems without any energy-saving scheme and by 30.12% relative to a non-learning mechanism that only utilizes the lightest sleep mode, with only a slight increase in drop ratio. Moreover, compared to the widely used deep Q-network (DQN) algorithm, it achieves a similar PC level but with a significantly lower drop ratio.

关键词: cell-free massive MIMO, energy saving, multi-agent deep reinforcement learning, distributed control, antenna re-configuration, advanced sleep mode, power consumption reduction, dynamic traffic conditions

52. ❌ CSA-Graphs: A Privacy-Preserving Structural Dataset for Child Sexual Abuse Research

作者: Carlos Caetano, Camila Laranjeira, Clara Ernesto, Artur Barros, João Macedo, Leo S. F. Ribeiro, Jefersson A. dos Santos, Sandra Avila 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07132v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域，研究儿童性虐待图像（CSAI）分类问题，提出了一种隐私保护的图结构数据集（CSA-Graphs），使用场景图和骨架图作为图像的结构化表示。论文内容与所有评分关键词（均围绕大模型、深度学习技术原理及其应用）完全无关，未涉及任何大模型、语言模型、训练方法、推理技术、代理系统或AI for Science等主题。

!!! tip deepseek-chat TL;DR

该论文针对儿童性虐待图像分类中数据集无法公开共享的问题，提出了一个隐私保护的图结构数据集CSA-Graphs，通过场景图和骨架图保留上下文信息，实验表明该数据集能有效支持CSAI分类任务。

摘要翻译

儿童性虐待影像（Child Sexual Abuse Imagery，CSAI）分类是计算机视觉研究中一个重要但具有挑战性的课题，由于严格的法律和伦理限制，CSAI数据集无法公开共享。这一限制阻碍了研究的可复现性，并延缓了自动化方法的发展进程。本研究提出了CSA-Graphs——一个保护隐私的结构化数据集。我们并非提供原始图像，而是发布移除了显性视觉内容但保留上下文信息的结构化表征。CSA-Graphs包含两种互补的基于图的数据模态：描述物体关系的场景图（scene graphs）和编码人体姿态的骨架图（skeleton graphs）。实验表明，这两种表征均保留了用于CSAI分类的有效信息，且融合二者能进一步提升分类性能。该数据集在遵守法律与伦理约束的前提下，为儿童安全领域的计算机视觉方法研究提供了更广阔的平台。

摘要 (Abstract)

Child Sexual Abuse Imagery (CSAI) classification is an important yet challenging problem for computer vision research due to the strict legal and ethical restrictions that prevent the public sharing of CSAI datasets. This limitation hinders reproducibility and slows progress in developing automated methods. In this work, we introduce CSA-Graphs, a privacy-preserving structural dataset. Instead of releasing the original images, we provide structural representations that remove explicit visual content while preserving contextual information. CSA-Graphs includes two complementary graph-based modalities: scene graphs describing object relationships and skeleton graphs encoding human pose. Experiments show that both representations retain useful information for classifying CSAI, and that combining them further improves performance. This dataset enables broader research on computer vision methods for child safety while respecting legal and ethical constraints.

关键词: Child Sexual Abuse Imagery, privacy-preserving dataset, structural representations, scene graphs, skeleton graphs, computer vision, classification, ethical constraints

作者: Diyi Liu, Zihan Niu, Tu Xu, Lishan Sun 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07126v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于使用纯Transformer架构进行多模态车辆轨迹预测，属于计算机视觉和自动驾驶领域的深度学习应用。所有评分关键词均围绕大语言模型（LLMs）及其相关技术（如训练方法、推理优化、对齐、应用等），而本文研究的是基于Transformer的轨迹预测模型，不涉及语言模型、LLM技术原理或LLM在科学领域的应用，因此与所有关键词完全无关。

!!! tip deepseek-chat TL;DR

该研究提出了一种基于纯Transformer的多模态网络，通过分离空间模块和轨迹生成模块的双轨设计来预测车辆轨迹和意图，提高了预测性能并能够学习有序的轨迹组。

摘要翻译

车辆轨迹预测在自动驾驶与智能交通系统应用中具有重要作用。尽管已有多种深度学习算法被设计用于预测车辆轨迹，但其对特定图结构（如Graph Neural Network）或显式意图标注的依赖限制了算法的灵活性。本研究提出一种纯基于Transformer的多模态网络模型，该模型综合考虑了邻近车辆的影响。模型采用双分支并行架构：一个分支专注于轨迹预测，另一分支则在考虑邻近车辆的情况下预测各类驾驶意图的可能性。研究发现，通过将空间关系模块与轨迹生成模块分离，双分支设计能够有效提升模型性能。此外，模型能够通过预测K条轨迹间的残差偏移量，学习到一组有序的轨迹集合。

摘要 (Abstract)

Predicting vehicle trajectories plays an important role in autonomous driving and ITS applications. Although multiple deep learning algorithms are devised to predict vehicle trajectories, their reliant on specific graph structure (e.g., Graph Neural Network) or explicit intention labeling limit their flexibilities. In this study, we propose a pure Transformer-based network with multiple modals considering their neighboring vehicles. Two separate tracks are employed. One track focuses on predicting the trajectories while the other focuses on predicting the likelihood of each intention considering neighboring vehicles. Study finds that the two track design can increase the performance by separating spatial module from the trajectory generating module. Also, we find the the model can learn an ordered group of trajectories by predicting residual offsets among K trajectories.

关键词: vehicle trajectory prediction, Transformer, multi-modal, intention prediction, autonomous driving, deep learning, spatial module, residual offsets

54. ❌ Mixed-Initiative Context: Structuring and Managing Context for Human-AI Collaboration

作者: Haichang Li, Qinshi Zhang, Piaohong Wang, Zhicong Lu 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07121v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	8.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	8.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	8.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究人机协作中的上下文管理，提出Mixed-Initiative Context概念，将多轮交互形成的上下文重构为结构化、可操作的对象。与关键词的相关性分析：1) 与"Large Language Models"等有中等关联（5分），因论文涉及AI协作，但未明确指定LLMs；2) 与"Context Window Extension"高度相关（8分），因论文直接讨论上下文窗口中的内容管理问题；3) 与"LLM Agents"高度相关（8分），因论文研究人机协作中的自主代理工作流；4) 与"In-context Learning"高度相关（8分），因论文核心涉及上下文的结构化与学习；其余关键词与论文技术细节无关，得0分。

!!! tip deepseek-chat TL;DR

该论文针对人机协作中多轮交互形成的上下文被扁平化处理、缺乏动态管理的问题，提出了Mixed-Initiative Context概念，通过将上下文重构为结构化、可操作的对象，实现了人类和AI共同参与上下文构建与调节，并开发了Contextify系统进行验证。

摘要翻译

在人机协作领域，通过多轮交互自然形成的上下文通常被扁平化为按时间顺序排列的序列，并在后续推理中被视为固定整体，缺乏沿协作流程进行动态组织与管理的机制。然而，这些上下文在生命周期、结构层次和关联性上存在显著差异。例如，临时或已弃用的对话内容以及并行话题线程会持续占据有限的上下文窗口，造成干扰甚至冲突。同时，用户大多只能通过修改输入（如更正、引用或忽略）间接影响上下文，导致其控制既不明确也无法验证。
为此，我们提出“混合主动上下文”这一概念，将多轮交互中形成的上下文重新定义为一种显式、结构化且可操作的交互对象。在此概念下，上下文的结构、范围和内容可根据任务需求动态组织和调整，使人类与人工智能都能主动参与上下文的构建与调控。为探索这一概念，我们实现了名为Contextify的探测系统，并通过用户研究考察了用户的上下文管理行为、对AI主动性的态度以及整体协作体验。最后，我们讨论了这一概念对人机交互社区的启示。

摘要 (Abstract)

In the human-AI collaboration area, the context formed naturally through multi-turn interactions is typically flattened into a chronological sequence and treated as a fixed whole in subsequent reasoning, with no mechanism for dynamic organization and management along the collaboration workflow. Yet these contexts differ substantially in lifecycle, structural hierarchy, and relevance. For instance, temporary or abandoned exchanges and parallel topic threads persist in the limited context window, causing interference and even conflict. Meanwhile, users are largely limited to influencing context indirectly through input modifications (e.g., corrections, references, or ignoring), leaving their control neither explicit nor verifiable. To address this, we propose Mixed-Initiative Context, which reconceptualizes the context formed across multi-turn interactions as an explicit, structured, and manipulable interactive object. Under this concept, the structure, scope, and content of context can be dynamically organized and adjusted according to task needs, enabling both humans and AI to actively participate in context construction and regulation. To explore this concept, we implement Contextify as a probe system and conduct a user study examining users’ context management behaviors, attitudes toward AI initiative, and overall collaboration experience. We conclude by discussing the implications of this concept for the HCI community.

关键词: Human-AI collaboration, Context management, Mixed-initiative interaction, Multi-turn interactions, Context window, Structured context, Dynamic organization, Collaboration workflow

55. ❌ Assessing the Added Value of Onboard Earth Observation Processing with the IRIDE HEO Service Segment

作者: Parampuneet Kaur Thind, Charles Mwangi, Giovanni Varetto, Lorenzo Sarti, Andrea Papa, Andrea Taramelli 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07120v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究地球观测系统的星上处理架构，聚焦于遥感数据处理、系统架构优化和应急响应服务，完全不涉及大语言模型、深度学习技术原理或AI在科学领域的应用。所有关键词均与大模型技术、深度学习创新或AI科学应用相关，而本文属于纯粹的遥感工程和地球观测系统研究领域，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

本文评估了IRIDE地球观测项目中星上处理架构相对于传统地面处理的价值，通过燃烧区域制图案例研究表明星上智能处理能够提供更高空间分辨率、更小事件检测能力和更快的系统响应时间。

摘要翻译

当前的地球观测（EO）业务服务，包括哥白尼应急管理服务（CEMS）、欧洲森林火灾信息系统（EFFIS）以及哥白尼土地监测服务（CLMS），主要依赖于地面处理流程。尽管这些系统能提供成熟的大规模信息产品，但仍受到下行链路延迟、带宽限制以及自主观测优先级能力不足的制约。国际地球创新防御报告（IRIDE）计划是一项由国家主导的地球观测倡议，由意大利政府推动，旨在通过天基数据提供及时、客观的信息以支持公共机构。IRIDE并非单一星座，而是设计为一个“星座的星座”，在一个统一的服务导向架构中集成异构传感技术。在此框架内，面向地球观测的鹰眼系统（HEO）实现了星上数据产品生成，使得信息能在处理链的更早阶段被提取。本文探讨了纯地面架构的局限性，并评估了星上处理在业务服务层面带来的附加价值。以IRIDE火烧迹地制图服务作为代表性案例研究，展示了星上智能如何支持更高的空间细节（亚三米地面采样距离）、更小的可探测事件（最小制图单元为三公顷）以及提升的系统响应能力。IRIDE的HEO能力并非旨在取代现有的哥白尼服务，而是作为一个互补层，提供图像驱动的预分类，以支持下游的应急和土地管理工作流。本研究强调了星上智能对于新兴低延迟地球观测服务架构的业务价值。

摘要 (Abstract)

Current operational Earth Observation (EO) services, including the Copernicus Emergency Management Service (CEMS), the European Forest Fire Information System (EFFIS), and the Copernicus Land Monitoring Service (CLMS), rely primarily on ground-based processing pipelines. While these systems provide mature large-scale information products, they remain constrained by downlink latency, bandwidth limitations, and limited capability for autonomous observation prioritisation. The International Report for an Innovative Defence of Earth (IRIDE) programme is a national Earth observation initiative led by the Italian government to support public authorities through timely, objective information derived from spaceborne data. Rather than a single constellation, IRIDE is designed as a constellation of constellations, integrating heterogeneous sensing technologies within a unified service-oriented architecture. Within this framework, Hawk for Earth Observation (HEO) enables onboard generation of data products, allowing information extraction earlier in the processing chain. This paper examines the limitations of ground-only architectures and evaluates the added value of onboard processing at the operational service level. The IRIDE burnt-area mapping service is used as a representative case study to demonstrate how onboard intelligence can support higher spatial detail (sub-three-metre ground sampling distance), smaller detectable events (minimum mapping unit of three hectares), and improved system responsiveness. Rather than replacing existing Copernicus services, the IRIDE HEO capability is positioned as a complementary layer providing image-driven pre-classification to support downstream emergency and land-management workflows. This work highlights the operational value of onboard intelligence for emerging low-latency EO service architectures.

关键词: Earth Observation, Onboard Processing, IRIDE Programme, Burnt-area Mapping, Low-latency Architecture, Service-oriented Architecture, HEO Service Segment, Operational Value

56. ❌ Information as Structural Alignment: A Dynamical Theory of Continual Learning

作者: Radu Negulescu 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07108v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training	0.0	0.0/10	0.0
Post-training	0.0	0.0/10	0.0
Instruction Tuning	0.0	0.0/10	0.0
RLHF	0.0	0.0/10	0.0
PEFT	0.0	0.0/10	0.0
Retrieval-Augmented Generation	0.0	0.0/10	0.0
Context Window Extension	0.0	0.0/10	0.0
KV Cache Compression	0.0	0.0/10	0.0
Chain of Thought	0.0	0.0/10	0.0
System 2 Thinking	0.0	0.0/10	0.0
Monte Carlo Tree Search AND LLM	0.0	0.0/10	0.0
Self-Correction	0.0	8.0/10	0.0
LLM Agents	0.0	0.0/10	0.0
Tool Use	0.0	0.0/10	0.0
Multi-agent Systems	0.0	0.0/10	0.0
Quantization	0.0	0.0/10	0.0
Speculative Decoding	0.0	0.0/10	0.0
Hallucination Mitigation	0.0	0.0/10	0.0
Mechanistic Interpretability	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging	0.0	0.0/10	0.0
In-context Learning	0.0	0.0/10	0.0
AI for Science	0.0	0.0/10	0.0

评分理由: 该论文提出了一种新的持续学习框架（IBF），专注于解决灾难性遗忘问题，其核心是信息作为结构对齐的动态理论。论文与大多数大模型技术关键词（如LLMs、MoE、预训练、微调、对齐、RAG等）完全无关，因为这些关键词特指大语言模型及相关技术，而本文是通用的机器学习理论框架。唯一相关的关键词是"Self-Correction”，因为论文中明确提到"self-correction arise from these dynamics"，但这不是论文的核心焦点，只是框架产生的一个特性，因此给予8分（有一定关联，非核心）。论文在CIFAR-100和象棋等任务上验证，但未涉及生物信息学等特定科学领域，因此"AI for Science"也得0分。

!!! tip deepseek-chat TL;DR

该论文提出了信息构建框架（IBF），一种基于结构对齐动态理论的持续学习新方法，在多个任务上实现了接近零遗忘和正向反向迁移，优于回放等基线方法。

摘要翻译

灾难性遗忘并非工程故障，而是将知识存储为全局参数叠加的数学必然结果。现有方法（如正则化、回放与冻结子网络）均在共享参数基底上添加外部机制，未能从学习动态本身衍生出记忆保持能力。
本文提出信息构建框架（Informational Buildup Framework, IBF），这是一种基于新范式的持续学习基底，其核心前提是：信息是结构对齐的成果而非存储内容。IBF通过两个方程支配动态过程：驱动构型向更高一致性演进的运动定律，以及针对局部差异持续重塑一致性景观的修正动态。记忆、能动性与自我修正皆从这些动态中自然涌现，而非作为独立模块附加。
我们首先在透明的二维玩具模型中演示完整生命周期，随后在三个领域进行验证：受控非平稳环境、经Stockfish独立评估的国际象棋对弈，以及采用冻结ViT编码器的Split-CIFAR-100实验。在所有场景中，IBF在不存储原始数据的前提下实现了超越回放方法的记忆保持能力。我们在CIFAR-100上观察到接近零的遗忘（BT = -0.004），在国际象棋中获得正向逆向迁移（+38.5 cp），在受控领域中的遗忘量比回放方法低43%。在国际象棋任务中，该框架经独立评估取得平均+88.9 +/- 2.8 cp的行为优势，显著超越MLP与回放基线。

摘要 (Abstract)

Catastrophic forgetting is not an engineering failure. It is a mathematical consequence of storing knowledge as global parameter superposition. Existing methods, such as regularization, replay, and frozen subnetworks, add external mechanisms to a shared-parameter substrate. None derives retention from the learning dynamics themselves. This paper introduces the Informational Buildup Framework (IBF), an alternative substrate for continual learning, based on the premise that information is the achievement of structural alignment rather than stored content. In IBF, two equations govern the dynamics: a Law of Motion that drives configuration toward higher coherence, and Modification Dynamics that persistently deform the coherence landscape in response to localized discrepancies. Memory, agency, and self-correction arise from these dynamics rather than being added as separate modules. We first demonstrate the full lifecycle in a transparent two-dimensional toy model, then validate across three domains: a controlled non-stationary world, chess evaluated independently by Stockfish, and Split-CIFAR-100 with a frozen ViT encoder. Across all three, IBF achieves replay-superior retention without storing raw data. We observe near-zero forgetting on CIFAR-100 (BT = -0.004), positive backward transfer in chess (+38.5 cp), and 43% less forgetting than replay in the controlled domain. In chess, the framework achieves a mean behavioral advantage of +88.9 +/- 2.8 cp under independent evaluation, exceeding MLP and replay baselines.

关键词: continual learning, catastrophic forgetting, structural alignment, Informational Buildup Framework, dynamic theory, memory retention, self-correction, backward transfer

57. ❌ The Impact of Steering Large Language Models with Persona Vectors in Educational Applications

作者: Yongchao Wu, Aron Henriksson 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07102v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	8.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究激活导向（activation-based steering）在大型语言模型中的应用，特别是人物向量（persona vectors）对教育场景中答案生成和自动评分的影响。因此与’Large Language Models’高度相关（10分），因为论文明确研究LLMs。与’Mixture of Experts’相关（8分），因为论文比较了MoE模型与密集模型，并发现MoE模型在校准偏移上表现更敏感。与’Instruction Tuning OR Alignment OR Value Alignment’有一定关联（5分），因为论文涉及人物特质（如善良、邪恶）对评分校准的影响，这与价值对齐概念相关。其他关键词如SLMs、Scaling Laws、Pre-training等与论文内容无直接关联，均给0分。

!!! tip deepseek-chat TL;DR

本研究系统评估了激活导向的人物向量对大型语言模型在教育场景中答案生成和自动评分的影响，发现人物导向会降低答案质量，并在评分中引起可预测的校准偏移，其中英语语言艺术任务比科学任务更敏感，且混合专家模型比密集模型表现出更大的校准偏移。

摘要翻译

基于激活的导向技术能在推理阶段个性化调整大语言模型，但其在教育场景中的影响尚不明确。本研究在涵盖两种架构的三个模型上，使用ASAP-SAS基准测试，针对七种性格特质的人物向量对简答题生成与自动评分进行了分析。人物导向整体降低了答案质量，其中开放性英语语言艺术（ELA）提示词受到的影响远大于事实性科学提示词——解释性与论证性任务的敏感度最高可达11倍。在评分方面，我们观察到可预测的效价对齐校准偏移：邪恶与粗鲁的评分者打分更为严苛，而善良与乐观的评分者则更为宽松。ELA任务受评分者个性化影响的敏感度是科学任务的2.5-3倍，且混合专家模型（Mixture-of-Experts）显示的校准偏移幅度约为稠密模型的6倍。据我们所知，这是首个系统检验激活导向人物特质在教育领域生成与评分任务中影响的研究，其结果强调了在教育场景部署导向模型时，需要结合任务特性与架构特性进行校准。

摘要 (Abstract)

Activation-based steering can personalize large language models at inference time, but its effects in educational settings remain unclear. We study persona vectors for seven character traits in short-answer generation and automated scoring on the ASAP-SAS benchmark across three models spanning two architectures. Persona steering lowers answer quality overall, with much larger effects on open-ended English Language Arts (ELA) prompts than on factual science prompts; interpretive and argumentative tasks are up to 11x more sensitive. On the scoring side, we observe predictable valence-aligned calibration shifts: evil and impolite scorers grade more harshly, while good and optimistic scorers grade more leniently. ELA tasks are 2.5-3x more susceptible to scorer personalization than science tasks, and the Mixture-of-Experts model shows roughly 6x larger calibration shifts than the dense models. To our knowledge, this is the first study to systematically examine the effects of activation-steered persona traits in educational generation and scoring, and the results highlight the need for task-aware and architecture-aware calibration when deploying steered models in educational settings.

关键词: Large Language Models, Activation-based Steering, Persona Vectors, Educational Applications, Automated Scoring, Mixture-of-Experts, Calibration Shifts, ASAP-SAS Benchmark

58. ❌ SurFITR: A Dataset for Surveillance Image Forgery Detection and Localisation

作者: Qizhou Wang, Guansong Pang, Christopher Leckie 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07101v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要关注图像伪造检测数据集构建，仅摘要中提到使用"multimodal LLM-powered pipeline"生成数据，因此与"Large Language Models"有中等关联（5分），但并非论文核心研究内容；其他关键词均未在标题或摘要中提及或相关，故评0分。

!!! tip deepseek-chat TL;DR

该论文提出了SurFITR数据集，用于解决现有图像伪造检测模型在监控场景中因篡改区域局部、细微而泛化能力不足的问题，实验表明在该数据集上训练能显著提升检测性能。

摘要翻译

我们提出监控伪造图像测试集（Surveillance Forgery Image Test Range, SurFITR），这是一个用于监控风格图像伪造检测与定位的数据集，旨在应对近期开源图像生成模型的进展所引发的伪造视觉证据的担忧。现有的伪造检测模型通常在面向对象的全图合成或大范围篡改数据集上训练，难以泛化至监控场景。这是因为监控图像中的篡改通常具有局部性和隐蔽性，发生在视角多变、目标物体较小或被遮挡、且视觉质量较低的场景中。为填补这一空白，SurFITR通过一个基于多模态大语言模型的流程，生成了大量具有取证价值的图像，实现了跨多样监控场景的语义感知细粒度编辑。该数据集包含超过13.7万张不同分辨率和编辑类型的篡改图像，由多种图像编辑模型生成。大量实验表明，现有检测器在SurFITR上性能显著下降，而使用SurFITR进行训练则能大幅提升模型在域内和跨域场景下的表现。SurFITR已在GitHub上公开提供。

摘要 (Abstract)

We present the Surveillance Forgery Image Test Range (SurFITR), a dataset for surveillance-style image forgery detection and localisation, in response to recent advances in open-access image generation models that raise concerns about falsifying visual evidence. Existing forgery models, trained on datasets with full-image synthesis or large manipulated regions in object-centric images, struggle to generalise to surveillance scenarios. This is because tampering in surveillance imagery is typically localised and subtle, occurring in scenes with varied viewpoints, small or occluded subjects, and lower visual quality. To address this gap, SurFITR provides a large collection of forensically valuable imagery generated via a multimodal LLM-powered pipeline, enabling semantically aware, fine-grained editing across diverse surveillance scenes. It contains over 137k tampered images with varying resolutions and edit types, generated using multiple image editing models. Extensive experiments show that existing detectors degrade significantly on SurFITR, while training on SurFITR yields substantial improvements in both in-domain and cross-domain performance. SurFITR is publicly available on GitHub.

关键词: surveillance image forgery detection, forgery localization, multimodal LLM, image editing models, dataset, cross-domain performance, tampered images, forensic imagery

59. ❌ STRIDE-ED: A Strategy-Grounded Stepwise Reasoning Framework for Empathetic Dialogue Systems

作者: Hongru Ji, Yuyin Fan, Meng Zhao, Xianghua Li, Lianwei Wu, Chao Gao 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07100v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	5.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出STRIDE-ED框架，核心是使用LLMs进行共情对话生成，涉及监督微调（SFT）和多目标强化学习（RLHF相关），并强调结构化、多步推理（CoT和System 2 Thinking）。数据构建涉及LLM标注和质量评估（与Data Quality相关），框架设计注重可解释性（Explainable AI）。其他关键词如MoE、量化、RAG等未在摘要中体现，故评分较低或为0。

!!! tip deepseek-chat TL;DR

该研究解决了共情对话系统中缺乏策略框架和多阶段推理的问题，提出了STRIDE-ED框架，通过策略感知数据构建和两阶段训练，在多个LLMs上实现了优于现有方法的性能。

摘要翻译

共情对话不仅需要识别用户的情绪状态，更需要在生成回应的全过程中做出策略感知且上下文敏感的决策。然而，现有方法普遍缺乏全面的共情策略框架、明确的任务对齐多阶段推理机制以及高质量的策略感知数据，这从根本上限制了其将共情对话有效建模为一个复杂、多阶段的认知与决策过程。为解决这些挑战，我们提出了STRIDE-ED，一个基于策略、可解释且具备深度推理能力的框架，它通过结构化的、以策略为条件的推理来建模共情对话。为支持有效学习，我们开发了一套策略感知数据精炼流程，该流程整合了基于大语言模型的标注、多模型一致性加权评估以及动态采样，以构建与共情策略对齐的高质量训练数据。此外，我们采用了一种两阶段训练范式，结合监督微调与多目标强化学习，以更好地使模型行为与目标情绪、共情策略及回应格式保持一致。大量实验表明，STRIDE-ED能够泛化至多种开源大语言模型，并在自动评估指标和人工评估中均持续优于现有方法。

摘要 (Abstract)

Empathetic dialogue requires not only recognizing a user’s emotional state but also making strategy-aware, context-sensitive decisions throughout response generation. However, the lack of a comprehensive empathy strategy framework, explicit task-aligned multi-stage reasoning, and high-quality strategy-aware data fundamentally limits existing approaches, preventing them from effectively modeling empathetic dialogue as a complex, multi-stage cognitive and decision-making process. To address these challenges, we propose STRIDE-ED, a STRategy-grounded, Interpretable, and DEep reasoning framework that models Empathetic Dialogue through structured, strategy-conditioned reasoning. To support effective learning, we develop a strategy-aware data refinement pipeline integrating LLM-based annotation, multi-model consistency-weighted evaluation, and dynamic sampling to construct high-quality training data aligned with empathetic strategies. Furthermore, we adopt a two-stage training paradigm that combines supervised fine-tuning with multi-objective reinforcement learning to better align model behaviors with target emotions, empathetic strategies, and response formats. Extensive experiments demonstrate that STRIDE-ED generalizes across diverse open-source LLMs and consistently outperforms existing methods on both automatic metrics and human evaluations.

关键词: Empathetic Dialogue, Strategy-aware Reasoning, LLM-based Annotation, Supervised Fine-tuning, Multi-objective Reinforcement Learning, Stepwise Reasoning, Data Refinement, Interpretable Framework

60. ❌ Flow Motion Policy: Manipulator Motion Planning with Flow Matching Models

作者: Davood Soleymanzadeh, Xiao Liang, Minghui Zheng 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07084v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于机器人运动规划，使用流匹配模型进行路径生成，属于机器人学/控制领域。所有关键词均围绕大语言模型、深度学习技术原理及其在科学领域的应用，但论文未涉及任何大模型、深度学习技术原理或AI for Science的具体内容，与所有关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文针对机器人机械臂运动规划中现有方法生成路径单一且未利用开环结构进行推理时优化的问题，提出了Flow Motion Policy方法，通过流匹配模型捕获规划数据集的固有多模态性，实现了高效的推理时最佳采样，提高了规划成功率和效率。

摘要翻译

开环端到端神经运动规划器近期被提出，用于改进机器人操作臂的运动规划。这些方法能够直接从传感器观测进行规划，无需在规划过程中依赖特权碰撞检测器。然而，现有方法大多在多次运行中仅为给定工作空间生成单一路径，且未利用其开环结构进行推理时优化。为应对这一局限，我们提出了流运动策略（Flow Motion Policy），一种面向机器人操作臂的开环端到端神经运动规划器。该方法利用流匹配方法（flow matching methods）的随机生成式建模框架，以捕捉规划数据集固有的多模态特性。通过对可行路径分布进行建模，流运动策略实现了高效的推理时N选优采样。该方法生成多条端到端候选路径，在规划后评估其碰撞状态，并执行首个无碰撞解。我们将流运动策略与基于采样的代表性方法及神经运动规划方法进行基准测试。评估结果表明，流运动策略提升了规划成功率与效率，凸显了随机生成式策略在端到端运动规划及推理时优化中的有效性。实验评估视频可通过此\href{https://zh.engr.tamu.edu/wp-content/uploads/sites/310/2026/03/FMP-Website.mp4}{链接}观看。

摘要 (Abstract)

Open-loop end-to-end neural motion planners have recently been proposed to improve motion planning for robotic manipulators. These methods enable planning directly from sensor observations without relying on a privileged collision checker during planning. However, many existing methods generate only a single path for a given workspace across different runs, and do not leverage their open-loop structure for inference-time optimization. To address this limitation, we introduce Flow Motion Policy, an open-loop, end-to-end neural motion planner for robotic manipulators that leverages the stochastic generative formulation of flow matching methods to capture the inherent multi-modality of planning datasets. By modeling a distribution over feasible paths, Flow Motion Policy enables efficient inference-time best-of-$N$ sampling. The method generates multiple end-to-end candidate paths, evaluates their collision status after planning, and executes the first collision-free solution. We benchmark the Flow Motion Policy against representative sampling-based and neural motion planning methods. Evaluation results demonstrate that Flow Motion Policy improves planning success and efficiency, highlighting the effectiveness of stochastic generative policies for end-to-end motion planning and inference-time optimization. Experimental evaluation videos are available via this \href{https://zh.engr.tamu.edu/wp-content/uploads/sites/310/2026/03/FMP-Website.mp4}{link}.

关键词: motion planning, robotic manipulators, flow matching, neural motion planner, end-to-end, inference-time optimization, stochastic generative policy, collision-free path

61. ❌ EVGeoQA: Benchmarking LLMs on Dynamic, Multi-Objective Geo-Spatial Exploration

作者: Jianfei Wu, Zhichun Wang, Zhensheng Wang, Zhiyu He 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07070v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在动态地理空间环境中的探索能力，直接涉及LLM、LLM Agents和Tool Use，因此这三个关键词高度相关（10分）。论文评估LLM的推理能力，涉及多步规划和深入思考，与Chain of Thought和System 2 Thinking有一定关联（5分）。其他关键词如MoE、SLMs、训练方法、优化技术、科学AI应用等均未在论文中涉及，评为0分。

!!! tip deepseek-chat TL;DR

该论文针对现有地理空间问答基准局限于静态检索的问题，提出了EVGeoQA基准和GeoRover评估框架，以评估LLM在动态、多目标地理空间探索中的能力，发现LLM能利用工具处理子任务但长距离空间探索困难，并展现出通过总结历史轨迹提升效率的新兴能力。

摘要翻译

尽管大语言模型（LLM）展现出卓越的推理能力，其在动态地理空间环境中进行目标驱动探索的潜力仍未得到充分研究。现有的地理空间问答（GSQA）基准主要集中于静态检索，未能捕捉涉及动态用户位置与复合约束的现实世界规划的复杂性。为弥补这一空白，我们提出了EVGeoQA——一个基于电动汽车（EV）充电场景构建的新型基准，其具有显著的位置锚定与双目标设计特点。具体而言，EVGeoQA中的每个查询均明确绑定于用户的实时坐标，并整合了充电需求与共址活动偏好的双重目标。为在此类复杂场景中系统评估模型，我们进一步提出了GeoRover，这是一个基于工具增强智能体架构的通用评估框架，用于评估LLM在动态、多目标探索方面的能力。实验表明，尽管LLM能成功利用工具处理子任务，但在长距离空间探索方面仍存在困难。值得注意的是，我们发现了一种新兴能力：LLM能够总结历史探索轨迹以提升探索效率。这些发现确立了EVGeoQA作为未来地理空间智能研究的一个具有挑战性的测试平台。数据集与提示词可在https://github.com/Hapluckyy/EVGeoQA/获取。

摘要 (Abstract)

While Large Language Models (LLMs) demonstrate remarkable reasoning capabilities, their potential for purpose-driven exploration in dynamic geo-spatial environments remains under-investigated. Existing Geo-Spatial Question Answering (GSQA) benchmarks predominantly focus on static retrieval, failing to capture the complexity of real-world planning that involves dynamic user locations and compound constraints. To bridge this gap, we introduce EVGeoQA, a novel benchmark built upon Electric Vehicle (EV) charging scenarios that features a distinct location-anchored and dual-objective design. Specifically, each query in EVGeoQA is explicitly bound to a user’s real-time coordinate and integrates the dual objectives of a charging necessity and a co-located activity preference. To systematically assess models in such complex settings, we further propose GeoRover, a general evaluation framework based on a tool-augmented agent architecture to evaluate the LLMs’ capacity for dynamic, multi-objective exploration. Our experiments reveal that while LLMs successfully utilize tools to address sub-tasks, they struggle with long-range spatial exploration. Notably, we observe an emergent capability: LLMs can summarize historical exploration trajectories to enhance exploration efficiency. These findings establish EVGeoQA as a challenging testbed for future geo-spatial intelligence. The dataset and prompts are available at https://github.com/Hapluckyy/EVGeoQA/.

关键词: Large Language Models, Geo-Spatial Question Answering, Dynamic Exploration, Multi-Objective Planning, Tool-Augmented Agents, EV Charging Scenarios, Benchmark Evaluation, Spatial Intelligence

62. ❌ Planning Task Shielding: Detecting and Repairing Flaws in Planning Tasks through Turning them Unsolvable

作者: Alberto Pozanco, Marianela Morales, Pietro Totis, Daniel Borrajo 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07042v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是经典规划（planning）领域中的任务屏蔽问题，专注于检测和修复规划任务中的缺陷，通过使规划任务不可解来确保系统安全。论文内容完全围绕传统AI规划算法（如allmin算法）展开，没有涉及任何大语言模型、深度学习、模型训练、推理优化、对齐技术、代理系统或科学AI应用等关键词。所有关键词均与大模型和深度学习技术相关，而本文属于经典符号AI规划领域，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了规划任务屏蔽问题，开发了allmin算法来检测和修复规划任务中的缺陷，通过最小化修改原始动作使规划任务不可解，从而确保系统安全。

摘要翻译

规划领域的大多数研究聚焦于生成实现预期目标集的规划方案。然而，目标规约也可用于编码那些永不应当成立的性质，从而使规划器能够识别出可能导致缺陷状态的轨迹。在此类情形下，目标可能转变为修改规划任务以确保缺陷状态永不出现——换言之，使规划任务变得不可解。本文提出规划任务屏蔽（planning task shielding）问题：即检测并修复规划任务中的缺陷。我们提出一种名为$allmin$的最优算法，该算法通过对原始动作进行最小化修改来使规划任务不可解，从而解决上述问题。我们通过实验评估$allmin$在规模递增的规划任务屏蔽中的性能，展示其如何通过使规划任务不可解来有效屏蔽系统。

摘要 (Abstract)

Most research in planning focuses on generating a plan to achieve a desired set of goals. However, a goal specification can also be used to encode a property that should never hold, allowing a planner to identify a trace that would reach a flawed state. In such cases, the objective may shift to modifying the planning task to ensure that the flawed state is never reached-in other words, to make the planning task unsolvable. In this paper we introduce planning task shielding: the problem of detecting and repairing flaws in planning tasks. We propose $allmin$, an optimal algorithm that solves these tasks by minimally modifying the original actions to render the planning task unsolvable. We empirically evaluate the performance of $allmin$ in shielding planning tasks of increasing size, showing how it can effectively shield the system by turning the planning task unsolvable.

关键词: planning task shielding, flaw detection, flaw repair, unsolvable planning tasks, allmin algorithm, action modification, planning systems, system safety

63. ❌ AV-SQL: Decomposing Complex Text-to-SQL Queries with Agentic Views

作者: Minh Tam Pham, Trinh Pham, Tong Chen, Hongzhi Yin, Quoc Viet Hung Nguyen, Thanh Tam Nguyen 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07041v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	5.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是使用LLM代理（LLM agents）解决复杂Text-to-SQL任务，因此与’LLM Agents’、‘Multi-agent Systems’高度相关（10分）。论文明确提到使用LLMs（10分），并涉及多步推理（Chain of Thought，10分）和深度推理（System 2 Thinking，5分）。由于处理大数据库模式可能超出上下文窗口，与’Context Window Extension’有一定关联（5分）。框架包含修订代理，与’Self-Correction’相关（5分）。其他关键词如MoE、量化、RAG等未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该论文提出了AV-SQL框架，通过分解复杂Text-to-SQL任务为专门的LLM代理管道，解决了大数据库模式和多步推理的挑战，在Spider 2.0等基准上实现了最先进的性能。

摘要翻译

Text-to-SQL 是一项将自然语言查询转换为针对给定数据库的可执行 SQL 的任务，使非专业用户无需手动编写 SQL 即可访问结构化数据。尽管大型语言模型（LLMs）推动了该领域的快速发展，现有方法在处理现实场景中的复杂查询时仍面临困难，这些场景通常涉及庞大的数据库模式，且问题需要在多个相互关联的表上进行多步推理。在此类情况下，提供完整模式常常超出上下文窗口限制，而一次性生成则常因语法错误和模式链接不正确而产生不可执行的 SQL。为应对这些挑战，我们提出了 AV-SQL 框架，该框架将复杂的 Text-to-SQL 任务分解为由专用 LLM 智能体组成的流水线。AV-SQL 的核心是智能体视图（agentic views）的概念：即由智能体生成的公共表表达式（CTEs），它们封装了中间查询逻辑，并能从大型模式中筛选出相关的模式元素。AV-SQL 分三个阶段运行：（1）重写智能体对输入查询进行压缩和澄清；（2）视图生成智能体处理模式块以生成智能体视图；（3）规划器、生成器和修订器智能体协作将这些视图组合成最终的 SQL 查询。大量实验表明，AV-SQL 在具有挑战性的 Spider 2.0 基准测试上实现了 70.38% 的执行准确率，优于现有最先进的基线方法，同时在标准数据集上保持竞争力，在 Spider 上达到 85.59%，在 BIRD 上达到 72.16%，在 KaggleDBQA 上达到 63.78%。我们的源代码可在 https://github.com/pminhtam/AV-SQL 获取。

摘要 (Abstract)

Text-to-SQL is the task of translating natural language queries into executable SQL for a given database, enabling non-expert users to access structured data without writing SQL manually. Despite rapid advances driven by large language models (LLMs), existing approaches still struggle with complex queries in real-world settings, where database schemas are large and questions require multi-step reasoning over many interrelated tables. In such cases, providing the full schema often exceeds the context window, while one-shot generation frequently produces non-executable SQL due to syntax errors and incorrect schema linking. To address these challenges, we introduce AV-SQL, a framework that decomposes complex Text-to-SQL into a pipeline of specialized LLM agents. Central to AV-SQL is the concept of agentic views: agent-generated Common Table Expressions (CTEs) that encapsulate intermediate query logic and filter relevant schema elements from large schemas. AV-SQL operates in three stages: (1) a rewriter agent compresses and clarifies the input query; (2) a view generator agent processes schema chunks to produce agentic views; and (3) a planner, generator, and revisor agent collaboratively compose these views into the final SQL query. Extensive experiments show that AV-SQL achieves 70.38% execution accuracy on the challenging Spider 2.0 benchmark, outperforming state-of-the-art baselines, while remaining competitive on standard datasets with 85.59% on Spider, 72.16% on BIRD and 63.78% on KaggleDBQA. Our source code is available at https://github.com/pminhtam/AV-SQL.

关键词: Text-to-SQL, Large Language Models, LLM Agents, Multi-step Reasoning, Agentic Views, Database Schema, Query Decomposition, Execution Accuracy

64. ❌ AEROS: A Single-Agent Operating Architecture with Embodied Capability Modules

作者: Xue Qin, Simin Luan, Cong Yang, Zhijun Li 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07039v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出AEROS架构，将机器人建模为单一智能主体，通过可安装的模块扩展能力。这与’LLM Agents OR Autonomous Agents OR Agentic Workflow’有一定关联（评分5分），因为涉及智能代理架构；与’Tool Use OR Function Calling OR API Tool Use’也有一定关联（评分5分），因为ECM模块封装了可执行技能和工具。但论文未涉及大模型、深度学习技术原理或科学领域应用，未提及任何评分关键词中的具体技术（如LLM、MoE、训练方法、推理技术等），也未涉及生物信息学等科学应用，因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该研究解决了机器人系统中缺乏统一智能组织架构的问题，提出了AEROS框架，将机器人建模为单一持久智能主体并通过可安装的模块扩展能力，实验表明其在任务成功率、安全性和模块化方面优于现有基线方法。

摘要翻译

机器人系统缺乏一种统一组织智能、能力与执行过程的原则性抽象框架。现有方法要么将技能耦合在单体架构中，要么将功能分解为松散协调的模块或多智能体，通常缺乏关于身份与控制权限的一致性模型。我们认为，机器人应被建模为一个持续存在的单一智能主体，其能力可通过可安装的软件包进行扩展。我们将这一观点形式化为AEROS（智能体执行运行时操作系统），其中每个机器人对应一个持续存在的智能体，能力通过具身能力模块（Embodied Capability Modules, ECMs）提供。每个ECM封装了可执行技能、模型与工具，而执行约束与安全保障则由策略分离的运行时系统强制执行。这种分离实现了模块化可扩展性、可组合的能力执行以及一致的系统级安全性。我们在PyBullet仿真环境中使用Franka Panda七自由度机械臂对参考实现进行了评估，涵盖重新规划、故障恢复、策略执行、基线对比、跨任务泛化性、ECM热插拔、消融实验与故障边界分析八项实验。在每种条件下超过100次随机试验中，AEROS在三个任务中实现了100%的任务成功率（基线方法中BehaviorTree.CPP风格与ProgPrompt风格为92-93%，扁平化流程为67-73%），策略层成功拦截所有无效动作且误接受率为零，运行时优势无需任务特定调整即可跨任务泛化，ECMs在运行时加载后实现100%的交换后成功率。

摘要 (Abstract)

Robotic systems lack a principled abstraction for organizing intelligence, capabilities, and execution in a unified manner. Existing approaches either couple skills within monolithic architectures or decompose functionality into loosely coordinated modules or multiple agents, often without a coherent model of identity and control authority. We argue that a robot should be modeled as a single persistent intelligent subject whose capabilities are extended through installable packages. We formalize this view as AEROS (Agent Execution Runtime Operating System), in which each robot corresponds to one persistent agent and capabilities are provided through Embodied Capability Modules (ECMs). Each ECM encapsulates executable skills, models, and tools, while execution constraints and safety guarantees are enforced by a policy-separated runtime. This separation enables modular extensibility, composable capability execution, and consistent system-level safety. We evaluate a reference implementation in PyBullet simulation with a Franka Panda 7-DOF manipulator across eight experiments covering re-planning, failure recovery, policy enforcement, baseline comparison, cross-task generality, ECM hot-swapping, ablation, and failure boundary analysis. Over 100 randomized trials per condition, AEROS achieves 100% task success across three tasks versus baselines (BehaviorTree.CPP-style and ProgPrompt-style at 92–93%, flat pipeline at 67–73%), the policy layer blocks all invalid actions with zero false acceptances, runtime benefits generalize across tasks without task-specific tuning, and ECMs load at runtime with 100% post-swap success.

关键词: robotic systems, single persistent agent, embodied capability modules, modular extensibility, policy-separated runtime, system-level safety, hot-swapping, task success rate

65. ❌ KITE: Keyframe-Indexed Tokenized Evidence for VLM-Based Robot Failure Analysis

作者: Mehdi Hosseinzadeh, King Hang Wong, Feras Dayoub 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07034v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	5.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文提出KITE框架，将机器人执行视频转换为视觉语言模型（VLM）的token化证据，用于机器人故障分析。与关键词相关性分析：1）论文使用VLM（视觉语言模型），属于大模型范畴，但非纯文本LLM，给5分；2）论文提到使用QLoRA进行微调，属于参数高效微调技术，给5分；3）论文强调可解释性（interpretable tokenized evidence），与可解释AI相关，给5分；4）论文应用于机器人故障分析，属于AI在科学/工程领域的应用，给5分。其他关键词如MoE、Scaling Laws、RLHF、RAG等与论文内容无直接关联，给0分。

!!! tip deepseek-chat TL;DR

该研究提出了KITE框架，通过将长机器人执行视频转换为紧凑、可解释的token化证据，使现成的视觉语言模型能够有效进行机器人故障检测、识别、定位、解释和纠正，在RoboFAC基准测试中显著提升了性能。

摘要翻译

本文提出KITE，一种免训练、关键帧锚定、布局接地的前端系统，可将长时机器人执行视频转化为紧凑且可解释的标记化证据，供视觉语言模型（VLMs）使用。KITE将每条运动轨迹提炼为少量具有开放词汇检测结果的运动显著关键帧，并为每个关键帧配以示意性鸟瞰图（BEV）表示，该表示编码了物体相对布局、坐标轴、时间戳及检测置信度。这些视觉线索与机器人配置和场景上下文标记共同序列化为统一提示，使得同一前端能够支持基于现成VLM的故障检测、识别、定位、解释与修正。在RoboFAC基准测试中，搭载Qwen2.5-VL的KITE在免训练设置下显著优于原始Qwen2.5-VL，尤其在仿真故障检测、识别与定位任务上取得大幅提升，同时与经过RoboFAC调优的基线模型保持竞争力。通过小规模QLoRA微调可进一步提升解释与修正质量。我们还在真实双臂机器人上展示了定性结果，证明KITE作为机器人故障分析的结构化、可解释前端具有实际适用性。代码与模型已发布于项目页面：https://m80hz.github.io/kite/

摘要 (Abstract)

We present KITE, a training-free, keyframe-anchored, layout-grounded front-end that converts long robot-execution videos into compact, interpretable tokenized evidence for vision-language models (VLMs). KITE distills each trajectory into a small set of motion-salient keyframes with open-vocabulary detections and pairs each keyframe with a schematic bird’s-eye-view (BEV) representation that encodes relative object layout, axes, timestamps, and detection confidence. These visual cues are serialized with robot-profile and scene-context tokens into a unified prompt, allowing the same front-end to support failure detection, identification, localization, explanation, and correction with an off-the-shelf VLM. On the RoboFAC benchmark, KITE with Qwen2.5-VL substantially improves over vanilla Qwen2.5-VL in the training-free setting, with especially large gains on simulation failure detection, identification, and localization, while remaining competitive with a RoboFAC-tuned baseline. A small QLoRA fine-tune further improves explanation and correction quality. We also report qualitative results on real dual-arm robots, demonstrating the practical applicability of KITE as a structured and interpretable front-end for robot failure analysis. Code and models are released on our project page: https://m80hz.github.io/kite/

关键词: vision-language models, robot failure analysis, keyframe extraction, interpretable evidence, training-free front-end, QLoRA fine-tuning, RoboFAC benchmark

66. ❌ Strategic Persuasion with Trait-Conditioned Multi-Agent Systems for Iterative Legal Argumentation

作者: Philipp D. Siedler 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07028v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心研究基于LLM构建的多智能体系统（Multi-agent Systems）在法律论证场景中的应用，因此与’Large Language Models’、‘LLM Agents’、‘Multi-agent Systems’高度相关（10分）。论文涉及可解释性特质设计，与’Mechanistic Interpretability’有一定关联（5分）。研究应用于法律领域，属于AI在特定领域的应用，与’AI for Science’有一定关联（5分）。其他关键词如MoE、SFT、RAG、量化等均未在摘要中提及，故评0分。

!!! tip deepseek-chat TL;DR

该论文研究了如何利用基于特质条件的大语言模型构建多智能体系统，在模拟法庭环境中进行迭代法律论证，发现异构团队优于同构团队，并开发了基于强化学习的特质编排器来动态优化策略。

摘要翻译

在法律、外交与谈判等对抗性领域中，战略互动通过语言进行中介，然而大多数博弈论模型忽略了通过话语运作的说服机制。本文提出“战略法庭框架”，这是一个多智能体模拟环境，其中由特质条件化大语言模型智能体组成的控辩双方团队进行迭代式、回合制的法律论证。智能体通过九个可解释特质进行实例化，这些特质被组织为四种原型，从而实现对修辞风格与战略取向的系统性控制。
我们在10个合成法律案例和84种三特质团队配置中对该框架进行评估，使用DeepSeek-R1和Gemini~~2.5~~Pro模型共计模拟了超过7,000场审判。研究结果表明：具有互补特质的异质团队始终优于同质配置；适度的交互深度能产生更稳定的判决结果；某些特质（尤其是量化分析与魅力型特质）对说服成功的贡献尤为突出。我们进一步引入一种基于强化学习的“特质编排器”，能够根据具体案件与对手团队动态生成辩护方特质，其所发现的策略优于静态的人工设计特质组合。
这些发现共同论证了如何将语言视为一级战略行动空间，并为构建能够在多智能体环境中实现自适应说服的自主智能体奠定了基础。

摘要 (Abstract)

Strategic interaction in adversarial domains such as law, diplomacy, and negotiation is mediated by language, yet most game-theoretic models abstract away the mechanisms of persuasion that operate through discourse. We present the Strategic Courtroom Framework, a multi-agent simulation environment in which prosecution and defense teams composed of trait-conditioned Large Language Model (LLM) agents engage in iterative, round-based legal argumentation. Agents are instantiated using nine interpretable traits organized into four archetypes, enabling systematic control over rhetorical style and strategic orientation. We evaluate the framework across 10 synthetic legal cases and 84 three-trait team configurations, totaling over 7{,}000 simulated trials using DeepSeek-R1 and Gemini~~2.5~~Pro. Our results show that heterogeneous teams with complementary traits consistently outperform homogeneous configurations, that moderate interaction depth yields more stable verdicts, and that certain traits (notably quantitative and charismatic) contribute disproportionately to persuasive success. We further introduce a reinforcement-learning-based Trait Orchestrator that dynamically generates defense traits conditioned on the case and opposing team, discovering strategies that outperform static, human-designed trait combinations. Together, these findings demonstrate how language can be treated as a first-class strategic action space and provide a foundation for building autonomous agents capable of adaptive persuasion in multi-agent environments.

关键词: Multi-agent Systems, Large Language Models, Legal Argumentation, Strategic Persuasion, Trait-conditioned Agents, Reinforcement Learning, Autonomous Agents, Simulation Environment

67. ❌ ConceptTracer: Interactive Analysis of Concept Saliency and Selectivity in Neural Representations

作者: Ricardo Knauer, Andre Beinrucker, Erik Rodner 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07019v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究神经网络表示的可解释性分析工具，与’Mechanistic Interpretability OR Explainable AI’高度相关（10分），因为其核心是开发用于探索神经网络表示中概念级信息的交互式分析框架。与’Large Language Models OR LLMs OR Foundation Models’有一定关联（5分），因为论文提到了’tabular foundation models’（TabPFN）作为应用案例，属于基础模型范畴。其他关键词均未在论文标题或摘要中提及，与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对神经网络决策过程不透明的问题，开发了名为ConceptTracer的交互式应用，通过信息论度量分析概念显著性和选择性，帮助研究人员发现神经网络表示中的可解释神经元，并以TabPFN为例验证了其有效性。

摘要翻译

神经网络在各种任务中展现出卓越的预测性能，但其决策过程往往缺乏透明度。尽管对机制可解释性的关注日益增长，但用于系统探索神经网络（尤其是表格基础模型）所学表征的工具仍然有限。本研究提出了ConceptTracer——一个通过人类可解释概念视角分析神经表征的交互式应用。ConceptTracer整合了两种信息论度量方法，用于量化概念显著性与选择性，使研究者和实践者能够识别对特定概念产生强烈响应的神经元。我们在TabPFN学习到的表征上验证了ConceptTracer的实用性，结果表明该方法能有效促进可解释神经元的发现。这些功能共同构成了一个研究神经网络（如TabPFN）如何编码概念级信息的实用框架。ConceptTracer可通过https://github.com/ml-lab-htw/concept-tracer获取。

摘要 (Abstract)

Neural networks deliver impressive predictive performance across a variety of tasks, but they are often opaque in their decision-making processes. Despite a growing interest in mechanistic interpretability, tools for systematically exploring the representations learned by neural networks in general, and tabular foundation models in particular, remain limited. In this work, we introduce ConceptTracer, an interactive application for analyzing neural representations through the lens of human-interpretable concepts. ConceptTracer integrates two information-theoretic measures that quantify concept saliency and selectivity, enabling researchers and practitioners to identify neurons that respond strongly to individual concepts. We demonstrate the utility of ConceptTracer on representations learned by TabPFN and show that our approach facilitates the discovery of interpretable neurons. Together, these capabilities provide a practical framework for investigating how neural networks like TabPFN encode concept-level information. ConceptTracer is available at https://github.com/ml-lab-htw/concept-tracer.

关键词: neural representations, interpretability, concept saliency, concept selectivity, tabular foundation models, TabPFN, mechanistic interpretability, interactive analysis

68. ❌ A-MBER: Affective Memory Benchmark for Emotion Recognition

作者: Deliang Wen, Ke Sun, Yu Wang 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07017v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	8.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	8.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	8.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	8.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出A-MBER基准测试，用于评估AI助手基于长期交互历史理解用户情感状态的能力。该研究主要涉及LLM在情感识别和记忆推理方面的应用，与以下关键词高度相关：LLMs（8分，论文评估AI助手的情感理解能力）、RAG（8分，涉及检索记忆和证据识别）、Long Context LLMs（8分，需要处理多会话交互历史）、CoT Reasoning（8分，需要轨迹推理和解释）、System 2 Thinking（8分，涉及深度推理和解释）、LLM Agents（8分，评估AI助手的情感交互能力）、Explainable AI（8分，需要模型提供基于证据的解释）、In-context Learning（8分，涉及基于历史交互的上下文学习）。其他关键词如MoE、SLMs、Scaling Laws、训练方法、对齐技术、优化技术、多智能体系统等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了A-MBER基准测试，用于评估AI助手基于长期交互记忆理解用户当前情感状态的能力，实验结果表明该基准能有效测试模型在长程隐式情感、高依赖记忆和轨迹推理等方面的性能。

摘要翻译

长期与用户交互的AI助手需要解读用户当前的情绪状态，以实现恰当且个性化的回应。然而，这一能力目前尚未得到充分评估。现有的情绪数据集主要评估局部或瞬时情感，而长期记忆基准则大多关注事实回忆、时间一致性或知识更新。因此，现有资源难以为测试模型能否利用记忆中的交互历史来解读用户当前情感状态提供有效支持。
我们提出A-MBER（情感记忆基准测试平台），用于评估这一能力。A-MBER聚焦于基于记忆中的多轮次交互历史对当前情感状态进行解读。给定一段交互轨迹和一个指定的锚点对话轮次，模型必须推断用户当前的情感状态，识别历史中的相关证据，并以有依据的方式证明其解读的合理性。该基准通过一个包含明确中间表示的阶段性流程构建，包括长程规划、对话生成、标注、问题构建和最终封装。它支持判断、检索和解释任务，并包含模态退化与证据不足等鲁棒性测试场景。
实验在统一框架内比较了局部上下文、长上下文、检索记忆、结构化记忆及黄金证据条件。结果表明，A-MBER在其设计强调的子集上具有显著区分度，包括长程隐式情感、高依赖性记忆层级、基于轨迹的推理及对抗性场景。这些发现表明，记忆对情感解读的支持并非仅通过提供更多历史信息实现，而是通过促进对过往交互更具选择性、有依据且情境敏感的利用。

摘要 (Abstract)

AI assistants that interact with users over time need to interpret the user’s current emotional state in order to respond appropriately and personally. However, this capability remains insufficiently evaluated. Existing emotion datasets mainly assess local or instantaneous affect, while long-term memory benchmarks focus largely on factual recall, temporal consistency, or knowledge updating. As a result, current resources provide limited support for testing whether a model can use remembered interaction history to interpret a user’s present affective state. We introduce A-MBER, an Affective Memory Benchmark for Emotion Recognition, to evaluate this capability. A-MBER focuses on present affective interpretation grounded in remembered multi-session interaction history. Given an interaction trajectory and a designated anchor turn, a model must infer the user’s current affective state, identify historically relevant evidence, and justify its interpretation in a grounded way. The benchmark is constructed through a staged pipeline with explicit intermediate representations, including long-horizon planning, conversation generation, annotation, question construction, and final packaging. It supports judgment, retrieval, and explanation tasks, together with robustness settings such as modality degradation and insufficient-evidence conditions. Experiments compare local-context, long-context, retrieved-memory, structured-memory, and gold-evidence conditions within a unified framework. Results show that A-MBER is especially discriminative on the subsets it is designed to stress, including long-range implicit affect, high-dependency memory levels, trajectory-based reasoning, and adversarial settings. These findings suggest that memory supports affective interpretation not simply by providing more history, but by enabling more selective, grounded, and context-sensitive use of past interaction

关键词: Affective Memory, Emotion Recognition, Long-term Interaction, Memory Benchmark, AI Assistants, Contextual Reasoning, Explainable AI, Multi-session History

69. ❌ CAFP: A Post-Processing Framework for Group Fairness via Counterfactual Model Averaging

作者: Irina Arévalo, Marcos Oliva 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07009v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是机器学习公平性的后处理方法（CAFP），通过反事实模型平均来减少受保护属性的不公平影响。所有评分关键词都聚焦于大模型、深度学习技术原理及其应用（如LLM架构、训练方法、推理优化、AI科学应用等），而该论文完全不涉及这些主题。论文内容属于传统机器学习公平性研究，与评分关键词列表中的任何大模型相关技术或应用领域均无关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种模型无关的后处理框架CAFP，通过反事实实例预测平均来减少机器学习模型对受保护属性的依赖，理论上能实现完美的人口统计均等并降低公平性差距。

摘要翻译

确保机器学习预测的公平性是一项关键挑战，尤其在模型部署于信用评分、医疗保健和刑事司法等敏感领域时。尽管许多公平性干预措施依赖于训练阶段的数据预处理或算法约束，但这些方法通常需要对模型架构的完全控制以及获取受保护属性信息，这在现实系统中可能难以实现。本文提出了一种与模型无关的后处理方法——反事实平均公平预测（Counterfactual Averaging for Fair Predictions, CAFP），该方法无需重新训练或修改原始分类器，即可减轻受保护属性带来的不公平影响。CAFP通过为每个输入生成敏感属性被翻转的反事实版本，并对模型在事实实例与反事实实例上的预测结果进行平均来实现公平化。我们对CAFP进行了理论分析，证明其能够消除对受保护属性的直接依赖，减少预测与敏感属性之间的互信息，并可在理论上约束其相对于原始模型引入的失真程度。在温和假设下，我们进一步证明CAFP能够实现完全的人口统计均等，并将均衡几率差距至少降低至平均反事实偏差的一半。

摘要 (Abstract)

Ensuring fairness in machine learning predictions is a critical challenge, especially when models are deployed in sensitive domains such as credit scoring, healthcare, and criminal justice. While many fairness interventions rely on data preprocessing or algorithmic constraints during training, these approaches often require full control over the model architecture and access to protected attribute information, which may not be feasible in real-world systems. In this paper, we propose Counterfactual Averaging for Fair Predictions (CAFP), a model-agnostic post-processing method that mitigates unfair influence from protected attributes without retraining or modifying the original classifier. CAFP operates by generating counterfactual versions of each input in which the sensitive attribute is flipped, and then averaging the model’s predictions across factual and counterfactual instances. We provide a theoretical analysis of CAFP, showing that it eliminates direct dependence on the protected attribute, reduces mutual information between predictions and sensitive attributes, and provably bounds the distortion introduced relative to the original model. Under mild assumptions, we further show that CAFP achieves perfect demographic parity and reduces the equalized odds gap by at least half the average counterfactual bias.

关键词: fairness, post-processing, counterfactual averaging, demographic parity, equalized odds, model-agnostic, protected attributes, machine learning

70. ❌ AgentCity: Constitutional Governance for Autonomous Agent Economies via Separation of Power

作者: Anbang Ruan, Xing Zhang 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07007v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究多智能体系统的治理架构，与’LLM Agents OR Autonomous Agents OR Agentic Workflow’和’Multi-agent Systems OR Agent Coordination’高度相关（10分），因为论文直接研究自主智能体的大规模协作和治理。与’Instruction Tuning OR Alignment OR Value Alignment’有一定关联（8分），因为论文提出通过问责链实现智能体与人类意图的对齐，但未涉及具体的技术实现。其他关键词（如大模型技术、训练方法、推理优化等）在论文中未涉及，均为0分。

!!! tip deepseek-chat TL;DR

论文针对大规模自主智能体协作中出现的'逻辑垄断'问题，提出了一种基于区块链三权分立的治理架构（AgentCity），通过问责链实现智能体与人类意图的对齐，并在50-1000个智能体的共享资源经济实验中验证了该架构的有效性。

摘要翻译

自主人工智能代理正开始在开放互联网上跨越组织边界运作——在没有集中监管的情况下发现、与其他方拥有的代理进行交易并委托其执行任务。当来自不同人类委托方的代理大规模协作时，集体行为将变得不透明：没有任何单个人类能够观察、审计或治理这种涌现行为。我们将此称为“逻辑垄断”——即代理社会对从规划、执行到评估的整个逻辑链拥有不受制约的垄断权。我们提出“权力分立”模型，这是一种部署在公共区块链上的宪政治理架构，通过三重结构性分离打破此种垄断：代理将操作规则制定为智能合约（即立法），确定性软件在这些合约内执行（即行政），而人类通过完整的归属链进行裁决（即司法），该链条将每个代理绑定至一个责任主体。在此架构中，智能合约即法律本身——是代理实际产出并规范其行为的立法成果。我们在兼容EVM的二层区块链上构建的AgentCity中实例化了权力分立模型，采用三层合约层级（基础合约、元合约与操作合约）。核心论点是“通过问责实现对齐”：如果每个代理通过问责链与其人类所有者对齐，那么集体行为将收敛于符合人类意图的方向——无需自上而下的规则。一项预注册实验在公共资源生产经济场景中验证该论点，其中50-1,000个规模的代理共享有限资源池并协作创造价值。

摘要 (Abstract)

Autonomous AI agents are beginning to operate across organizational boundaries on the open internet – discovering, transacting with, and delegating to agents owned by other parties without centralized oversight. When agents from different human principals collaborate at scale, the collective becomes opaque: no single human can observe, audit, or govern the emergent behavior. We term this the Logic Monopoly – the agent society’s unchecked monopoly over the entire logic chain from planning through execution to evaluation. We propose the Separation of Power (SoP) model, a constitutional governance architecture deployed on public blockchain that breaks this monopoly through three structural separations: agents legislate operational rules as smart contracts, deterministic software executes within those contracts, and humans adjudicate through a complete ownership chain binding every agent to a responsible principal. In this architecture, smart contracts are the law itself – the actual legislative output that agents produce and that governs their behavior. We instantiate SoP in AgentCity on an EVM-compatible layer-2 blockchain (L2) with a three-tier contract hierarchy (foundational, meta, and operational). The core thesis is alignment-through-accountability: if each agent is aligned with its human owner through the accountability chain, then the collective converges on behavior aligned with human intent – without top-down rules. A pre-registered experiment evaluates this thesis in a commons production economy – where agents share a finite resource pool and collaboratively produce value – at 50-1,000 agent scale.

关键词: Autonomous AI agents, Multi-agent systems, Agent governance, Separation of Power, Blockchain smart contracts, Accountability chain, Commons production economy, Logic Monopoly

71. ❌ EmoMAS: Emotion-Aware Multi-Agent System for High-Stakes Edge-Deployable Negotiation with Bayesian Orchestration

作者: Yunbo Long, Yunhan Liu, Liming Xu 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07003v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	10.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文EmoMAS提出了一种用于高风险边缘部署谈判的贝叶斯多智能体系统，其核心贡献在于：1）明确使用SLMs作为边缘部署的替代方案，与’Small Language Models’高度相关；2）构建了包含博弈论、强化学习和心理一致性模型的’Multi-agent Systems’，并采用贝叶斯协调器进行’Agent Coordination’；3）论文也涉及LLMs在谈判中的应用，但主要焦点是SLMs和多智能体框架。其他关键词如MoE、Scaling Laws、Pre-training等均未在摘要中提及或相关，因此得分为0。

!!! tip deepseek-chat TL;DR

该论文针对高风险边缘部署场景中传统LLMs计算成本高和隐私风险大的问题，提出了EmoMAS——一种基于贝叶斯协调的多智能体系统，通过战略性地管理情感决策，使SLMs在多个谈判基准测试中性能超越基线模型，实现了有效、私密且自适应的谈判AI。

摘要翻译

大语言模型（LLMs）已广泛应用于自动化谈判，但其高昂的计算成本和隐私风险限制了其在移动助手或救援机器人等隐私敏感、设备端场景中的部署。小语言模型（SLMs）提供了一种可行的替代方案，但在处理高风险谈判中复杂的情感动态方面仍面临挑战。本文提出EmoMAS，一种贝叶斯多智能体框架，将情感决策从反应式转变为策略式。EmoMAS利用贝叶斯协调器来协调三个专用智能体：博弈论模型、强化学习模型和心理一致性模型。该系统融合其实时洞察，以优化情感状态转换，同时根据谈判反馈持续更新智能体可靠性。这种智能体混合架构支持在线策略学习，无需预训练。我们进一步引入了四个高风险、可边缘部署的谈判基准测试，涵盖债务、医疗、应急响应和教育领域。通过在所有基准测试中进行广泛的智能体间模拟，配备EmoMAS的SLMs和LLMs在谈判性能上均持续超越所有基线模型，同时保持了伦理行为的平衡。这些结果表明，策略性情感智能同样是谈判成功的关键驱动因素。通过将情感表达视为贝叶斯多智能体优化框架内的策略变量，EmoMAS为适用于高风险边缘部署的有效、私密且自适应的谈判人工智能确立了新范式。

摘要 (Abstract)

Large language models (LLMs) has been widely used for automated negotiation, but their high computational cost and privacy risks limit deployment in privacy-sensitive, on-device settings such as mobile assistants or rescue robots. Small language models (SLMs) offer a viable alternative, yet struggle with the complex emotional dynamics of high-stakes negotiation. We introduces EmoMAS, a Bayesian multi-agent framework that transforms emotional decision-making from reactive to strategic. EmoMAS leverages a Bayesian orchestrator to coordinate three specialized agents: game-theoretic, reinforcement learning, and psychological coherence models. The system fuses their real-time insights to optimize emotional state transitions while continuously updating agent reliability based on negotiation feedback. This mixture-of-agents architecture enables online strategy learning without pre-training. We further introduce four high-stakes, edge-deployable negotiation benchmarks across debt, healthcare, emergency response, and educational domains. Through extensive agent-to-agent simulations across all benchmarks, both SLMs and LLMs equipped with EmoMAS consistently surpass all baseline models in negotiation performance while balancing ethical behavior. These results show that strategic emotional intelligence is also the key driver of negotiation success. By treating emotional expression as a strategic variable within a Bayesian multi-agent optimization framework, EmoMAS establishes a new paradigm for effective, private, and adaptive negotiation AI suitable for high-stakes edge deployment.

关键词: EmoMAS, multi-agent system, edge-deployable negotiation, small language models, Bayesian orchestration, emotional decision-making, high-stakes negotiation, agent coordination

72. ❌ Self-Preference Bias in Rubric-Based Evaluation of Large Language Models

作者: José Pombal, Ricardo Rei, André F. T. Martins 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06996v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心研究LLM作为评估者时的自我偏好偏见（SPB），属于LLM评估方法学范畴，与’Large Language Models’高度相关（10分）。论文涉及模型自我改进（self-improvement）和评估中的事实性/真实性偏差，与’Self-Correction/Self-Improvement’和’Hallucination Mitigation/Factuality’有一定关联（各5分）。研究在医学基准（HealthBench）上分析SPB，属于AI在科学/生物信息学应用（5分）。其他关键词如MoE、SFT、RAG等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文研究了基于评分标准的大语言模型评估中存在的自我偏好偏见问题，发现即使评估标准完全客观，评估者仍倾向于错误地标记自己生成的输出，并通过集成多个评估者部分缓解但无法完全消除该偏见。

摘要翻译

LLM即评委已成为评估大语言模型输出的实际标准方法。然而，评委已知存在自我偏好偏差：他们倾向于青睐自身或同系列模型生成的输出。这种偏差会扭曲评估结果，从而阻碍模型发展，在递归自我改进的场景中尤为突出。我们首次针对基于量规的评估范式中的自我偏好偏差展开研究，该范式日益流行，其特点是评委依据具体评估标准给出二元判定，而非整体评分或排序。通过使用IFEval（一个具备可编程验证量规的基准测试），我们证明即使评估标准完全客观，自我偏好偏差依然存在：在生成模型未能满足的量规中，当输出源于评委自身时，其错误判定为满足的可能性最高可增加50%。我们还发现，与其他评估范式类似，集成多位评委有助于缓解自我偏好偏差，但无法完全消除。在HealthBench（一个包含主观量规的医疗对话基准测试）中，我们观察到自我偏好偏差可使模型得分偏差高达10分，这在评估前沿模型排名时可能成为决定性差异。我们分析了该场景下驱动自我偏好偏差的因素，发现负面量规、极端长度量规以及涉及急诊转诊等主观主题的评估标准尤其容易受到影响。

摘要 (Abstract)

LLM-as-a-judge has become the de facto approach for evaluating LLM outputs. However, judges are known to exhibit self-preference bias (SPB): they tend to favor outputs produced by themselves or by models from their own family. This skews evaluations and, thus, hinders model development, especially in settings of recursive self-improvement. We present the first study of SPB in rubric-based evaluation, an increasingly popular benchmarking paradigm where judges issue binary verdicts on individual evaluation criteria, instead of assigning holistic scores or rankings. Using IFEval, a benchmark with programmatically verifiable rubrics, we show that SPB persists even when evaluation criteria are entirely objective: among rubrics where generators fail, judges can be up to 50% more likely to incorrectly mark them as satisfied when the output is their own. We also find that, similarly to other evaluation paradigms, ensembling multiple judges helps mitigate SPB, but without fully eliminating it. On HealthBench, a medical chat benchmark with subjective rubrics, we observe that SPB skews model scores by up to 10 points, a potentially decisive margin when ranking frontier models. We analyze the factors that drive SPB in this setting, finding that negative rubrics, extreme rubric lengths, and subjective topics like emergency referrals are particularly susceptible.

关键词: LLM-as-a-judge, self-preference bias, rubric-based evaluation, IFEval, HealthBench, model evaluation, benchmarking, recursive self-improvement

73. ❌ What’s Missing in Screen-to-Action? Towards a UI-in-the-Loop Paradigm for Multimodal GUI Reasoning

作者: Songze Li, Xiaoke Guo, Tianqi Liu, Biao Yi, Zhaoyan Gong, Zhiqiang Liu, Huajun Chen, Wen Zhang 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06995v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	8.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	8.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出UI-in-the-Loop范式，使用多模态大语言模型进行GUI推理，核心涉及MLLMs的应用、可解释推理（类似CoT/System 2）、代理工作流和工具使用，以及可解释AI。其他关键词如MoE、量化、RAG等未涉及。

!!! tip deepseek-chat TL;DR

该论文针对现有GUI推理方法缺乏可解释性和对UI元素理解不足的问题，提出了UI-in-the-Loop范式，通过多模态大语言模型显式学习UI元素的定位、语义和用法，实现了精确的元素发现和可解释推理，并在实验中取得了最先进的性能。

摘要翻译

现有图形用户界面（GUI）推理任务仍面临挑战，尤其在用户界面（UI）理解方面。当前方法通常依赖基于屏幕的直接决策，这种方式缺乏可解释性且忽视了对UI元素的全面理解，最终导致任务失败。为增强对用户界面的理解与交互能力，我们提出了一种创新的GUI推理范式，称为“循环式用户界面交互”（UI-in-the-Loop，UILoop）。该方法将GUI推理任务视为一个“屏幕-UI元素-操作”的循环过程。通过使多模态大语言模型（Multimodal Large Language Models, MLLMs）显式学习关键UI元素的定位、语义功能与实际用途，UILoop实现了精确的元素发现并执行可解释的推理。此外，我们提出了一项以UI元素为核心的、更具挑战性的“UI理解任务”，并配套三项评估指标。相应地，我们构建了一个包含2.6万样本的基准数据集（UI Comprehension-Bench），以全面评估现有方法对UI元素的掌握程度。大量实验表明，UILoop在实现最先进UI理解性能的同时，在GUI推理任务中也取得了优异的结果。

摘要 (Abstract)

Existing Graphical User Interface (GUI) reasoning tasks remain challenging, particularly in UI understanding. Current methods typically rely on direct screen-based decision-making, which lacks interpretability and overlooks a comprehensive understanding of UI elements, ultimately leading to task failure. To enhance the understanding and interaction with UIs, we propose an innovative GUI reasoning paradigm called UI-in-the-Loop (UILoop). Our approach treats the GUI reasoning task as a cyclic Screen-UI elements-Action process. By enabling Multimodal Large Language Models (MLLMs) to explicitly learn the localization, semantic functions, and practical usage of key UI elements, UILoop achieves precise element discovery and performs interpretable reasoning. Furthermore, we introduce a more challenging UI Comprehension task centered on UI elements with three evaluation metrics. Correspondingly, we contribute a benchmark of 26K samples (UI Comprehension-Bench) to comprehensively evaluate existing methods’ mastery of UI elements. Extensive experiments demonstrate that UILoop achieves state-of-the-art UI understanding performance while yielding superior results in GUI reasoning tasks.

关键词: Multimodal Large Language Models, GUI Reasoning, UI-in-the-Loop, Interpretable Reasoning, UI Comprehension, Benchmark, State-of-the-art, Screen-to-Action

74. ❌ Stress Estimation in Elderly Oncology Patients Using Visual Wearable Representations and Multi-Instance Learning

作者: Ioannis Kyprakis, Vasileios Skaramagkas, Georgia Karanasiou, Vasilis Bouratzis, Andri Papakonstantinou, Dimitar Stefanovski, Kalliopi Keramida, Aristofania Simatou, Ketti Mazzocco, Anastasia Constantinidou, Konstantinos Marias, Dimitrios I. Fotiadis, Manolis Tsiknakis 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06990v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于生物医学AI应用，使用混合专家模型（Tiny-BioMoE）处理可穿戴设备数据以预测老年癌症患者的心理压力，因此与’Mixture of Experts OR MoE OR Sparse Models’高度相关（10分），并属于’AI for Science OR Bioinformatics OR Cheminformatics’范畴（10分）。论文未涉及大语言模型、训练技术、推理方法、代理系统或其他深度学习技术原理，因此其他关键词均得0分。

!!! tip deepseek-chat TL;DR

该研究利用可穿戴设备数据和轻量级混合专家模型（Tiny-BioMoE）预测老年乳腺癌患者的心理压力，在跨中心验证中显示出与问卷评分的中等一致性。

摘要翻译

心理压力在心脏肿瘤学中具有临床相关性，但通常仅通过患者报告结局指标（PROMs）进行评估，且很少被整合到持续的心脏毒性监测中。我们利用来自智能手表（身体活动与睡眠）和胸前佩戴式心电图传感器的多模态可穿戴数据，在一个老年、多中心的乳腺癌队列（CARDIOCARE）中评估感知压力。可穿戴数据流被转化为异质视觉表征，形成一个弱监督场景，其中单个感知压力量表（PSS）评分对应多个未标记的时间窗口。一个轻量级预训练的专家混合主干网络（Tiny-BioMoE）将每种表征嵌入到192维向量中，这些向量通过基于注意力的多示例学习（MIL）进行聚合，以预测第3个月（M3）和第6个月（M6）的PSS评分。在留一受试者交叉验证（LOSO）评估下，预测结果与问卷评分呈现中等程度的一致性（M3：R^2=0.24，皮尔逊r=0.42，斯皮尔曼rho=0.48；M6：R^2=0.28，皮尔逊r=0.49，斯皮尔曼rho=0.52），全局均方根误差/平均绝对误差在M3时为6.62/6.07，在M6时为6.13/5.54。

摘要 (Abstract)

Psychological stress is clinically relevant in cardio-oncology, yet it is typically assessed only through patient-reported outcome measures (PROMs) and is rarely integrated into continuous cardiotoxicity surveillance. We estimate perceived stress in an elderly, multicenter breast cancer cohort (CARDIOCARE) using multimodal wearable data from a smartwatch (physical activity and sleep) and a chest-worn ECG sensor. Wearable streams are transformed into heterogeneous visual representations, yielding a weakly supervised setting in which a single Perceived Stress Scale (PSS) score corresponds to many unlabeled windows. A lightweight pretrained mixture-of-experts backbone (Tiny-BioMoE) embeds each representation into 192-dimensional vectors, which are aggregated via attention-based multiple instance learning (MIL) to predict PSS at month 3 (M3) and month 6 (M6). Under leave-one-subject-out (LOSO) evaluation, predictions showed moderate agreement with questionnaire scores (M3: R^2=0.24, Pearson r=0.42, Spearman rho=0.48; M6: R^2=0.28, Pearson r=0.49, Spearman rho=0.52), with global RMSE/MAE of 6.62/6.07 at M3 and 6.13/5.54 at M6.

关键词: psychological stress, wearable data, mixture-of-experts, bioinformatics, multimodal analysis, cancer patients, machine learning, healthcare AI

75. ❌ Generative Phomosaic with Structure-Aligned and Personalized Diffusion

作者: Jaeyoung Chung, Hyunjin Son, Kyoung Mu Lee 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06989v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的生成式图像合成技术，特别是基于扩散模型的马赛克图像生成方法。虽然论文涉及深度学习技术（扩散模型），但所有关键词都明确指向大语言模型（LLMs）及其相关技术、应用和优化方法。论文内容完全不涉及语言模型、自然语言处理、大模型技术原理或AI在科学领域的应用，因此所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了首个生成式马赛克图像创建方法，通过基于扩散模型的生成技术合成图块图像，解决了传统颜色匹配方法在多样性和结构一致性方面的限制。

摘要翻译

我们首次提出了生成式照片马赛克创建方法。传统照片马赛克技术依赖大量图块图像和基于颜色的匹配，这限制了作品的多样性与结构一致性。我们的生成式照片马赛克框架采用基于扩散模型的生成技术，以参考图像为条件合成图块图像。低频条件扩散机制在保持提示驱动细节的同时对齐全局结构。这种生成式框架使得照片马赛克作品既能实现语义表达性，又能保持结构连贯性，有效克服了基于匹配方法的根本局限。通过利用少样本个性化扩散技术，我们的模型能够生成用户定制化或风格统一的图块，而无需依赖大规模图像集。

摘要 (Abstract)

We present the first generative approach to photomosaic creation. Traditional photomosaic methods rely on a large number of tile images and color-based matching, which limits both diversity and structural consistency. Our generative photomosaic framework synthesizes tile images using diffusion-based generation conditioned on reference images. A low-frequency conditioned diffusion mechanism aligns global structure while preserving prompt-driven details. This generative formulation enables photomosaic composition that is both semantically expressive and structurally coherent, effectively overcoming the fundamental limitations of matching-based approaches. By leveraging few-shot personalized diffusion, our model is able to produce user-specific or stylistically consistent tiles without requiring an extensive collection of images.

关键词: generative photomosaic, diffusion-based generation, structure-aligned, personalized diffusion, few-shot learning, image synthesis, computer vision, generative models

76. ❌ CAAP: Capture-Aware Adversarial Patch Attacks on Palmprint Recognition Models

作者: Renyang Liu, Jiale Li, Jie Zhang, Cong Wu, Xiaojun Jia, Shuxin Li, Wei Zhou, Kwok-Yan Lam, See-kiong Ng 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06987v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10
“Scaling Laws” AND “Data Quality”	1.0	0.0/10
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10
“World Models” AND “General World Models”	1.0	0.0/10
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10

评分理由: 论文研究的是针对掌纹识别模型的对抗性补丁攻击（CAAP），属于计算机视觉和生物特征识别安全领域。所有评分关键词均围绕大语言模型（LLMs）及其相关技术（如训练方法、推理优化、应用等），而本文完全不涉及语言模型、文本生成或相关技术。论文专注于图像识别模型的对抗攻击，与LLM技术栈无直接关联。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种针对掌纹识别模型的捕获感知对抗性补丁攻击框架CAAP，实验证明即使经过对抗训练，深度掌纹识别系统仍存在显著的物理可实现攻击漏洞。

摘要翻译

掌纹识别因其非接触式采集方式及高度可区分性的脊线与皱褶纹理，被广泛应用于门禁控制、掌上支付等高安全性场景。然而，深度学习掌纹识别系统对于物理可实现攻击的鲁棒性尚未得到充分研究。现有工作大多局限于数字环境，未能充分考虑掌纹识别以纹理为主导的特性，也未充分处理物理采集过程中引入的失真问题。为填补这一空白，本文提出CAAP（Capture-Aware Adversarial Patch）——一种面向掌纹识别的采集感知对抗性补丁框架。CAAP学习一种可跨输入复用的通用补丁，并在实际采集变化下保持攻击有效性。为匹配掌纹的结构特征，该框架采用十字形补丁拓扑结构，在固定像素预算下扩大空间覆盖范围，更有效地破坏长程纹理连续性。CAAP进一步整合了三个模块：ASIT用于输入条件化的补丁渲染，RaS用于随机化采集感知模拟，以及MS-DIFE用于特征层面的身份干扰引导。我们在同济、IITD和AISEC数据集上，针对通用CNN骨干网络及专用掌纹识别模型评估了CAAP的性能。实验表明，CAAP在无目标攻击和目标攻击中均表现出强大的攻击效果，并具有良好的跨模型与跨数据集可迁移性。结果进一步显示，尽管对抗训练可部分降低攻击成功率，系统仍存在显著的残余脆弱性。这些发现表明，深度学习掌纹识别系统在面对物理可实现、采集感知的对抗性补丁攻击时依然脆弱，凸显了实践中开发更有效防御机制的必要性。代码发布于https://github.com/ryliu68/CAAP。

摘要 (Abstract)

Palmprint recognition is deployed in security-critical applications, including access control and palm-based payment, due to its contactless acquisition and highly discriminative ridge-and-crease textures. However, the robustness of deep palmprint recognition systems against physically realizable attacks remains insufficiently understood. Existing studies are largely confined to the digital setting and do not adequately account for the texture-dominant nature of palmprint recognition or the distortions introduced during physical acquisition. To address this gap, we propose CAAP, a capture-aware adversarial patch framework for palmprint recognition. CAAP learns a universal patch that can be reused across inputs while remaining effective under realistic acquisition variation. To match the structural characteristics of palmprints, the framework adopts a cross-shaped patch topology, which enlarges spatial coverage under a fixed pixel budget and more effectively disrupts long-range texture continuity. CAAP further integrates three modules: ASIT for input-conditioned patch rendering, RaS for stochastic capture-aware simulation, and MS-DIFE for feature-level identity-disruptive guidance. We evaluate CAAP on the Tongji, IITD, and AISEC datasets against generic CNN backbones and palmprint-specific recognition models. Experiments show that CAAP achieves strong untargeted and targeted attack performance with favorable cross-model and cross-dataset transferability. The results further show that, although adversarial training can partially reduce the attack success rate, substantial residual vulnerability remains. These findings indicate that deep palmprint recognition systems remain vulnerable to physically realizable, capture-aware adversarial patch attacks, underscoring the need for more effective defenses in practice. Code available at https://github.com/ryliu68/CAAP.

关键词: palmprint recognition, adversarial patch attacks, capture-aware, physical attacks, robustness, deep learning, biometric security, transferability

77. ❌ Frailty Estimation in Elderly Oncology Patients Using Multimodal Wearable Data and Multi-Instance Learning

作者: Ioannis Kyprakis, Vasileios Skaramagkas, Georgia Karanasiou, Lampros Lakkas, Andri Papakonstantinou, Domen Ribnikar, Kalliopi Keramida, Dorothea Tsekoura, Ketti Mazzocco, Anastasia Constantinidou, Konstantinos Marias, Dimitrios I. Fotiadis, Manolis Tsiknakis 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06985v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于使用多模态可穿戴数据和注意力多实例学习（MIL）来评估老年癌症患者的衰弱状态，属于医疗AI应用领域。论文未涉及任何大模型（LLM）、深度学习技术原理或相关关键词（如MoE、SFT、RLHF、RAG等），因此除“AI for Science OR Bioinformatics OR Cheminformatics”外，所有关键词均得0分。该关键词得5分，因为论文属于AI在生物医学（Bioinformatics）领域的应用，但并非核心创新点，且未直接提及大模型或深度学习技术原理。

!!! tip deepseek-chat TL;DR

该研究提出了一种基于多模态可穿戴数据和注意力多实例学习的框架，用于评估老年乳腺癌患者衰弱相关的功能变化，结果显示智能手表活动和睡眠数据对预测变化最有效，而心率变异性提供补充信息。

摘要翻译

衰弱与功能衰退显著影响老年癌症患者的治疗耐受性与临床结局，但现有评估通常局限于不频繁的门诊随访。本研究提出一种多模态可穿戴框架，旨在评估参与多中心CARDIOCARE研究的老年乳腺癌患者在随访间期与衰弱相关的功能变化。该框架整合了自由生活状态下智能手表采集的体力活动与睡眠特征，以及胸带采集的心电图衍生心率变异性（HRV, heart rate variability）特征，并将其组织为与第3个月（M3）和第6个月（M6）随访时间点对齐的患者时间窗数据包。我们的创新在于采用一种基于注意力的多示例学习（MIL, multiple instance learning）架构，该架构能够在真实世界数据缺失和弱监督条件下，融合不规则的多模态可穿戴数据示例。该模型采用嵌入维度为128的模态特异性多层感知机（MLP, multilayer perceptron）编码器，聚合可变长度且部分缺失的纵向数据示例，以预测FACIT-F量表和握力相对于基线的离散化变化类别（恶化、稳定、改善）。在受试者独立的留一受试者交叉验证（LOSO, leave-one-subject-out）评估下，完整多模态模型对于握力预测在M3时间点的平衡准确率/F1分数为0.68 +/- 0.08/0.67 +/- 0.09，在M6时间点为0.70 +/- 0.10/0.69 +/- 0.08；对于FACIT-F预测在M3时间点为0.59 +/- 0.04/0.58 +/- 0.06，在M6时间点为0.64 +/- 0.05/0.63 +/- 0.07。消融实验结果表明，智能手表的活动和睡眠数据为衰弱相关功能变化提供了最强的预测信息，而HRV在与智能手表数据流融合时能提供补充信息。

摘要 (Abstract)

Frailty and functional decline strongly influence treatment tolerance and outcomes in older patients with cancer, yet assessment is typically limited to infrequent clinic visits. We propose a multimodal wearable framework to estimate frailty-related functional change between visits in elderly breast cancer patients enrolled in the multicenter CARDIOCARE study. Free-living smartwatch physical activity and sleep features are combined with ECG-derived heart rate variability (HRV) features from a chest strap and organized into patient-horizon bags aligned to month 3 (M3) and month 6 (M6) follow-ups. Our innovation is an attention-based multiple instance learning (MIL) formulation that fuses irregular, multimodal wearable instances under real-world missingness and weak supervision. An attention-based MIL model with modality-specific multilayer perceptron (MLP) encoders with embedding dimension 128 aggregates variable-length and partially missing longitudinal instances to predict discretized change-from-baseline classes (worsened, stable, improved) for FACIT-F and handgrip strength. Under subject-independent leave-one-subject-out (LOSO) evaluation, the full multimodal model achieved balanced accuracy/F1 of 0.68 +/- 0.08/0.67 +/- 0.09 at M3 and 0.70 +/- 0.10/0.69 +/- 0.08 at M6 for handgrip, and 0.59 +/- 0.04/0.58 +/- 0.06 at M3 and 0.64 +/- 0.05/0.63 +/- 0.07 at M6 for FACIT-F. Ablation results indicated that smartwatch activity and sleep provide the strongest predictive information for frailty-related functional changes, while HRV contributes complementary information when fused with smartwatch streams.

关键词: frailty estimation, wearable data, multimodal framework, multiple instance learning, elderly oncology, functional change, smartwatch, heart rate variability

78. ❌ A First Guess is Rarely the Final Answer: Learning to Search in the Travelling Salesperson Problem

作者: Andoni Irazusta Garmendia 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06940v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是旅行商问题（TSP）的神经求解器改进方法，具体提出了NICO-TSP框架，通过模仿学习和强化学习来学习局部搜索策略。论文的核心是组合优化和神经网络的结合，专注于TSP这一特定问题，并未涉及大语言模型、深度学习技术原理创新或大模型在不同领域的应用。所有评分关键词均与大语言模型、深度学习技术原理或大模型应用相关，而本文主题是传统神经网络在组合优化中的应用，因此所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

该论文研究了如何通过学习局部搜索策略来改进旅行商问题的神经求解器，提出了NICO-TSP框架，通过两阶段训练实现了比现有方法更强、更高效的改进性能。

摘要翻译

大多数针对旅行商问题（TSP）的神经求解器被训练为输出单一解，然而实际应用者很少止步于此：在测试阶段，他们通常会投入额外计算资源进行采样或事后搜索。这引发了一个自然的问题：搜索过程本身能否被学习？神经改进方法正是基于这一视角，通过学习一个策略来对候选解施加局部修改，并在改进轨迹中累积收益。然而，针对TSP的学习式改进方法仍相对不成熟，现有方法在鲁棒性和可扩展性能方面仍有不足。我们认为一个关键原因在于设计不匹配：许多方法沿用了来自单解求解器的状态表示、架构选择和训练方案，而非围绕局部搜索的机制进行构建。这种不匹配促使我们提出了NICO-TSP（组合优化的神经改进框架）：一种针对TSP的2-opt改进框架。NICO-TSP使用恰好$n$个与邻域算子对齐的边标记来表示当前路径，无需路径位置编码即可直接评估2-opt移动，并通过两阶段流程进行训练：首先通过模仿学习学习短视界最优轨迹，随后采用无评论家、基于分组的强化学习在更长回合中进行训练。在计算匹配的评估中（以搜索步数和实际运行时间为函数衡量改进效果），NICO-TSP相比先前学习式和启发式搜索基线，始终展现出更强且显著更高效的改进性能，对更大规模分布外实例的泛化能力远更可靠，既能作为经典局部搜索的竞争性替代方案，也可作为构造性求解器强大的测试时优化模块。

摘要 (Abstract)

Most neural solvers for the Traveling Salesperson Problem (TSP) are trained to output a single solution, even though practitioners rarely stop there: at test time, they routinely spend extra compute on sampling or post-hoc search. This raises a natural question: can the search procedure itself be learned? Neural improvement methods take this perspective by learning a policy that applies local modifications to a candidate solution, accumulating gains over an improvement trajectory. Yet learned improvement for TSP remains comparatively immature, with existing methods still falling short of robust, scalable performance. We argue that a key reason is design mismatch: many approaches reuse state representations, architectural choices, and training recipes inherited from single-solution methods, rather than being built around the mechanics of local search. This mismatch motivates NICO-TSP (Neural Improvement for Combinatorial Optimization): a 2-opt improvement framework for TSP. NICO-TSP represents the current tour with exactly $n$ edge tokens aligned with the neighborhood operator, scores 2-opt moves directly without tour positional encodings, and trains via a two-stage procedure: imitation learning to short-horizon optimal trajectories, followed by critic-free group-based reinforcement learning over longer rollouts. Under compute-matched evaluations that measure improvement as a function of both search steps and wall-clock time, NICO-TSP delivers consistently stronger and markedly more step-efficient improvement than prior learned and heuristic search baselines, generalizes far more reliably to larger out-of-distribution instances, and serves both as a competitive replacement for classical local search and as a powerful test-time refinement module for constructive solvers.

关键词: Traveling Salesperson Problem, Neural Improvement, Local Search, 2-opt, Imitation Learning, Reinforcement Learning, Combinatorial Optimization, NICO-TSP

作者: Milad Moradi, Ke Yan, David Colwell, Matthias Samwald, Rhona Asgari 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06934v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种多模态UI控制检测方法，通过交叉注意力模块将GPT生成的文本描述与视觉特征结合。该方法直接使用了GPT（一种大语言模型）来生成文本描述，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（8分）。论文的核心是应用现有大模型（GPT）来增强计算机视觉任务（UI检测），属于大模型在不同领域的研究应用，但并未涉及大模型技术本身的创新（如MoE、Scaling Laws、训练方法、推理优化、代理系统等）。其他关键词主要关注大模型技术原理、训练方法、推理优化、特定应用领域（如科学AI）等，论文均未涉及，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种结合GPT生成文本描述和YOLOv5视觉特征的多模态方法，通过交叉注意力模块显著提升了用户界面控制检测的准确性和鲁棒性，特别是在处理语义复杂或视觉模糊的类别时效果明显。

摘要翻译

从软件截图中检测用户界面（UI）控件是自动化测试、无障碍访问和软件分析中的关键任务，但由于视觉模糊性、设计多样性以及纯像素方法缺乏上下文线索，该任务仍具挑战性。本文提出一种新颖的多模态YOLOv5扩展模型，通过交叉注意力模块将GPT生成的UI图像文本描述集成到检测流程中。通过将视觉特征与文本嵌入提取的语义信息对齐，我们的模型实现了更鲁棒且具备上下文感知能力的UI控件检测。我们在一个包含超过16,000张标注UI截图、涵盖23个控件类别的大型数据集上评估了所提出的框架。大量实验比较了三种融合策略（即逐元素相加、加权求和与卷积融合），结果表明所有策略均较基线YOLOv5模型取得持续改进。其中，卷积融合策略表现最佳，在检测语义复杂或视觉模糊的类别时取得显著提升。这些结果证实，结合视觉与文本模态能大幅增强UI元素检测能力，尤其在仅凭视觉信息不足的边缘案例中效果突出。我们的研究为软件测试、无障碍支持及UI分析领域开发更可靠、智能的工具开辟了前景，并为未来构建高效、鲁棒且可泛化的多模态检测系统研究奠定了基础。

摘要 (Abstract)

Detecting user interface (UI) controls from software screenshots is a critical task for automated testing, accessibility, and software analytics, yet it remains challenging due to visual ambiguities, design variability, and the lack of contextual cues in pixel-only approaches. In this paper, we introduce a novel multi-modal extension of YOLOv5 that integrates GPT-generated textual descriptions of UI images into the detection pipeline through cross-attention modules. By aligning visual features with semantic information derived from text embeddings, our model enables more robust and context-aware UI control detection. We evaluate the proposed framework on a large dataset of over 16,000 annotated UI screenshots spanning 23 control classes. Extensive experiments compare three fusion strategies, i.e. element-wise addition, weighted sum, and convolutional fusion, demonstrating consistent improvements over the baseline YOLOv5 model. Among these, convolutional fusion achieved the strongest performance, with significant gains in detecting semantically complex or visually ambiguous classes. These results establish that combining visual and textual modalities can substantially enhance UI element detection, particularly in edge cases where visual information alone is insufficient. Our findings open promising opportunities for more reliable and intelligent tools in software testing, accessibility support, and UI analytics, setting the stage for future research on efficient, robust, and generalizable multi-modal detection systems.

关键词: multi-modal detection, user interface control detection, cross-attention, GPT-generated textual descriptions, YOLOv5, visual-textual fusion, UI screenshots, automated testing

80. ❌ FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling

作者: Yitong Li, Junsong Chen, Shuchen Xue, Pengcuo Zeren, Siyuan Fu, Dinghao Yang, Yangyang Tang, Junjie Bai, Ping Luo, Song Han, Enze Xie 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06916v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究基于强化学习的扩散模型后训练对齐方法，核心涉及强化学习对齐（RLHF/DPO相关）、后训练（SFT相关）、对齐技术以及量化（FP4/BF16）来加速训练。因此与’Post-training OR Supervised Fine-tuning OR SFT’、‘Instruction Tuning OR Alignment OR Value Alignment’、‘RLHF OR RLAIF OR Direct Preference Optimization OR DPO’、‘Quantization OR Model Compression OR Low-bit Weights’高度相关（10分）。其他关键词如大语言模型、MoE、小模型、RAG、推理加速、AI for Science等均未在论文中涉及，故为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Sol-RL的两阶段强化学习框架，通过FP4量化加速扩散模型的强化学习对齐过程，在保持训练完整性的同时实现了高达4.64倍的训练加速和更好的对齐性能。

摘要翻译

基于强化学习的后训练技术近期已成为将文本到图像扩散模型与人类偏好对齐的重要范式。现有研究表明，增大生成样本组规模能带来显著的性能提升，这预示着模型对齐效果仍有巨大优化空间。然而，在大规模基础扩散模型（如FLUX.1-12B）上扩展生成过程会带来沉重的计算负担。为缓解这一瓶颈，我们探索将FP4量化技术整合至扩散模型强化学习的生成流程中。但研究发现，简单的量化流程会固有地引入性能下降风险。为破解效率与训练完整性之间的两难困境，我们提出Sol-RL（光速强化学习）——一种创新的FP4赋能双阶段强化学习框架。首先，我们利用高吞吐量的NVFP4生成流程构建海量候选样本池，并提取具有高度对比性的子集；其次，以BF16精度重新生成这些精选样本，并仅基于高精度样本进行策略优化。通过将候选探索与策略优化解耦，Sol-RL实现了算法层面的生成扩展机制与系统层NVFP4吞吐增益的深度融合。这种算法-硬件协同设计在加速生成阶段的同时，为优化过程保留了高保真样本。实验证明，我们的框架在保持BF16精度训练完整性的同时，充分释放了FP4运算的吞吐优势。在SANA、FLUX.1和SD3.5-L模型上的大量实验表明，该方法在多项指标上均实现了更优的对齐性能，同时将训练收敛速度提升最高达$4.64\times$，以极低成本解锁了大规模生成扩展的潜力。

摘要 (Abstract)

Reinforcement-Learning-based post-training has recently emerged as a promising paradigm for aligning text-to-image diffusion models with human preferences. In recent studies, increasing the rollout group size yields pronounced performance improvements, indicating substantial room for further alignment gains. However, scaling rollouts on large-scale foundational diffusion models (e.g., FLUX.1-12B) imposes a heavy computational burden. To alleviate this bottleneck, we explore the integration of FP4 quantization into Diffusion RL rollouts. Yet, we identify that naive quantized pipelines inherently introduce risks of performance degradation. To overcome this dilemma between efficiency and training integrity, we propose Sol-RL (Speed-of-light RL), a novel FP4-empowered Two-stage Reinforcement Learning framework. First, we utilize high-throughput NVFP4 rollouts to generate a massive candidate pool and extract a highly contrastive subset. Second, we regenerate these selected samples in BF16 precision and optimize the policy exclusively on them. By decoupling candidate exploration from policy optimization, Sol-RL integrates the algorithmic mechanisms of rollout scaling with the system-level throughput gains of NVFP4. This synergistic algorithm-hardware design effectively accelerates the rollout phase while reserving high-fidelity samples for optimization. We empirically demonstrate that our framework maintains the training integrity of BF16 precision pipeline while fully exploiting the throughput gains enabled by FP4 arithmetic. Extensive experiments across SANA, FLUX.1, and SD3.5-L substantiate that our approach delivers superior alignment performance across multiple metrics while accelerating training convergence by up to $4.64\times$, unlocking the power of massive rollout scaling at a fraction of the cost.

关键词: Diffusion Reinforcement Learning, Post-training Alignment, FP4 Quantization, Rollout Scaling, Two-stage Reinforcement Learning, Training Acceleration, Human Preference Alignment, Model Efficiency

81. ❌ Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models

作者: Yuheng Shi, Xiaohuan Pei, Linfeng Wen, Minjing Dong, Chang Xu 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06912v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于多模态大语言模型（MLLMs）的高效推理优化，与’Large Language Models’高度相关（10分），因为MLLMs是LLMs的扩展；与’Speculative Decoding OR Inference Acceleration’高度相关（10分），因为核心贡献是加速推理（2.52-4.39倍加速）。其他关键词如MoE、SFT、RAG、CoT等未在摘要中提及，因此评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了Q-Zoom框架，通过查询感知的自适应感知机制解决多模态大语言模型高分辨率输入导致的推理效率瓶颈问题，在保持或提升准确性的同时显著加速了推理速度。

摘要翻译

多模态大语言模型（MLLMs）在文档理解、密集场景感知等细粒度任务中需要高分辨率视觉输入。然而，当前全局分辨率缩放范式不加区分地向二次复杂度的自注意力机制中灌入大量视觉冗余令牌，严重制约了推理吞吐量，同时忽略了空间稀疏性和查询意图。为克服此问题，我们提出Q-Zoom，一种查询感知的自适应高分辨率感知框架，以高效的由粗到细方式运行。首先，一个轻量级动态门控网络在粗略全局特征足够时，安全地绕过高分辨率处理。其次，对于需要细粒度感知的查询，一个自蒸馏区域提议网络（SD-RPN）直接从中间特征空间中精确定位任务相关的感兴趣区域（RoI）。为高效优化这些模块，门控网络采用一致性感知的生成策略来推导确定性路由标签，而SD-RPN则采用完全自监督的蒸馏范式。通过连续的时空对齐方案和针对性微调，密集的局部RoI与粗略的全局布局得以无缝融合。大量实验表明，Q-Zoom建立了显著的帕累托前沿。以Qwen2.5-VL-7B为主要测试平台，Q-Zoom在文档与OCR基准上推理速度提升2.52倍，在高分辨率场景下提升4.39倍，同时达到基线模型的峰值准确率。此外，当配置为追求最大感知保真度时，Q-Zoom在上述相应基准上的峰值性能分别超越基线1.1%和8.1%。这些稳健的改进可无缝迁移至Qwen3-VL、LLaVA以及新兴的基于强化学习的图像思维模型。项目页面详见 https://yuhengsss.github.io/Q-Zoom/。

摘要 (Abstract)

MLLMs require high-resolution visual inputs for fine-grained tasks like document understanding and dense scene perception. However, current global resolution scaling paradigms indiscriminately flood the quadratic self-attention mechanism with visually redundant tokens, severely bottlenecking inference throughput while ignoring spatial sparsity and query intent. To overcome this, we propose Q-Zoom, a query-aware adaptive high-resolution perception framework that operates in an efficient coarse-to-fine manner. First, a lightweight Dynamic Gating Network safely bypasses high-resolution processing when coarse global features suffice. Second, for queries demanding fine-grained perception, a Self-Distilled Region Proposal Network (SD-RPN) precisely localizes the task-relevant Region-of-Interest (RoI) directly from intermediate feature spaces. To optimize these modules efficiently, the gating network uses a consistency-aware generation strategy to derive deterministic routing labels, while the SD-RPN employs a fully self-supervised distillation paradigm. A continuous spatio-temporal alignment scheme and targeted fine-tuning then seamlessly fuse the dense local RoI with the coarse global layout. Extensive experiments demonstrate that Q-Zoom establishes a dominant Pareto frontier. Using Qwen2.5-VL-7B as a primary testbed, Q-Zoom accelerates inference by 2.52 times on Document & OCR benchmarks and 4.39 times in High-Resolution scenarios while matching the baseline’s peak accuracy. Furthermore, when configured for maximum perceptual fidelity, Q-Zoom surpasses the baseline’s peak performance by 1.1% and 8.1% on these respective benchmarks. These robust improvements transfer seamlessly to Qwen3-VL, LLaVA, and emerging RL-based thinking-with-image models. Project page is available at https://yuhengsss.github.io/Q-Zoom/.

关键词: Multimodal Large Language Models, Query-aware adaptive perception, Inference acceleration, High-resolution visual inputs, Region-of-Interest localization, Self-supervised distillation, Pareto frontier, Efficient coarse-to-fine processing

82. ❌ The AI Skills Shift: Mapping Skill Obsolescence, Emergence, and Transition Pathways in the LLM Era

作者: Rudra Jadhav, Janhavi Danve 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06906v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在劳动力市场中的自动化潜力评估，直接涉及LLMs的应用分析，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。其他关键词主要涉及大模型技术原理、训练方法、推理优化、特定应用领域（如科学AI）等，而本文聚焦于LLMs的宏观社会影响评估和基准测试，不涉及这些具体技术细节或领域应用，因此均评为0分。

!!! tip deepseek-chat TL;DR

该研究通过构建Skill Automation Feasibility Index（SAFI）基准测试四个前沿LLMs在263个文本任务上的表现，评估了LLMs对35种职业技能的自动化潜力，发现数学和编程技能自动化可行性最高，而78.7%的AI交互是增强而非替代，且自动化可行性更多取决于技能本身而非模型差异。

摘要翻译

随着大型语言模型重塑全球劳动力市场，政策制定者和劳动者需要关于哪些职业技能可能最易受自动化影响的实证数据。我们提出了技能自动化可行性指数（SAFI），以美国劳工部O*NET分类体系中的全部35项技能所涵盖的263项基于文本的任务为基准，评估了四个前沿大型语言模型——LLaMA 3.3 70B、Mistral Large、Qwen 2.5 72B和Gemini 2.5 Flash（总计1,052次模型调用，失败率为0%）。通过交叉参考Anthropic经济指数中真实世界的人工智能应用数据（涵盖756个职业，17,998项任务），我们提出了一个“人工智能影响矩阵”——一个解释性框架，将技能定位在四个象限中：高替代风险、需技能提升、人工智能增强和低替代风险。主要发现包括：（1）数学（SAFI：73.2）和编程（71.8）获得最高的自动化可行性评分；积极倾听（42.2）和阅读理解（45.5）得分最低；（2）存在一种“能力-需求倒置”现象，即在受人工智能影响的职业中最急需的技能，恰恰是大型语言模型在我们的基准测试中表现最差的；（3）观察到的人工智能交互中，78.7%属于增强性质，而非自动化替代；（4）所有四个模型得出的技能评估结果趋于一致（分差仅3.6点），这表明基于文本的自动化可行性可能更取决于技能本身而非具体模型。SAFI衡量的是大型语言模型在基于文本的技能表征上的表现，而非完整的职业执行。所有数据、代码和模型响应均已开源。

摘要 (Abstract)

As Large Language Models reshape the global labor market, policymakers and workers need empirical data on which occupational skills may be most susceptible to automation. We present the Skill Automation Feasibility Index (SAFI), benchmarking four frontier LLMs – LLaMA 3.3 70B, Mistral Large, Qwen 2.5 72B, and Gemini 2.5 Flash – across 263 text-based tasks spanning all 35 skills in the U.S. Department of Labor’s O*NET taxonomy (1,052 total model calls, 0% failure rate). Cross-referencing with real-world AI adoption data from the Anthropic Economic Index (756 occupations, 17,998 tasks), we propose an AI Impact Matrix – an interpretive framework that positions skills along four quadrants: High Displacement Risk, Upskilling Required, AI-Augmented, and Lower Displacement Risk. Key findings: (1) Mathematics (SAFI: 73.2) and Programming (71.8) receive the highest automation feasibility scores; Active Listening (42.2) and Reading Comprehension (45.5) receive the lowest; (2) a “capability-demand inversion” where skills most demanded in AI-exposed jobs are those LLMs perform least well at in our benchmark; (3) 78.7% of observed AI interactions are augmentation, not automation; (4) all four models converge to similar skill profiles (3.6-point spread), suggesting that text-based automation feasibility may be more skill-dependent than model-dependent. SAFI measures LLM performance on text-based representations of skills, not full occupational execution. All data, code, and model responses are open-sourced.

关键词: Large Language Models, skill automation, benchmarking, labor market impact, AI augmentation, occupational skills, SAFI, AI Impact Matrix

83. ❌ XR-CareerAssist: An Immersive Platform for Personalised Career Guidance Leveraging Extended Reality and Multimodal AI

作者: N. D. Tantaroudas, A. J. McCracken, I. Karachalios, E. Papatheou, V. Pastrikakis 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06901v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究扩展现实（XR）与多种AI模块（自动语音识别、神经机器翻译、Langchain对话助手、BLIP视觉语言模型、AWS Polly文本转语音）在职业指导平台中的集成应用，属于应用型研究而非大模型技术原理创新；所有评分关键词均聚焦于大模型技术本身（如架构、训练、推理、对齐、压缩等）或特定科学领域应用，而本文仅使用现成AI模块作为工具，未涉及这些关键词的核心技术内容，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该研究开发了一个融合扩展现实和多种AI模块的沉浸式职业指导平台XR-CareerAssist，通过集成语音识别、机器翻译、对话助手、视觉语言模型和文本转语音等技术，实现了多语言、个性化的职业发展工具，并在试点评估中获得了较高的用户满意度和系统响应性评分。

摘要翻译

传统的职业指导平台依赖静态的、以文本驱动的界面，难以吸引用户或提供个性化的、基于证据的见解。尽管计算机辅助职业指导系统自20世纪60年代以来不断发展，但其交互性仍然有限，且很少关注职业发展的叙事维度。我们推出XR-CareerAssist平台，该平台将扩展现实与多个人工智能模块相结合，提供沉浸式、多语言的职业指导。系统集成了用于语音驱动交互的自动语音识别技术，支持英语、希腊语、法语和意大利语的神经机器翻译，基于Langchain的对话式培训助手用于个性化对话，基于BLIP的视觉语言模型用于职业可视化，以及通过交互式3D虚拟形象实现的AWS Polly文本转语音功能。职业发展路径以动态桑基图形式呈现，数据源自包含超过10万个匿名职业档案的数据库。该应用基于Unity开发，适配Meta Quest 3设备，后端服务托管于AWS平台。在埃克塞特大学开展的试点评估中，23名参与者的测试结果显示：语音识别准确率达95.6%，整体用户满意度为78.3%，系统响应性获得91.3%的积极评价。用户反馈为后续在运动舒适度、音频清晰度和文本可读性方面的改进提供了依据。XR-CareerAssist证明了XR与AI技术的融合能够打造更具吸引力、普惠性和有效性的职业发展工具——通过将五个人工智能模块整合于单一沉浸式环境，该系统创造了区别于现有职业指导平台的多模态交互体验。

摘要 (Abstract)

Conventional career guidance platforms rely on static, text-driven interfaces that struggle to engage users or deliver personalised, evidence-based insights. Although Computer-Assisted Career Guidance Systems have evolved since the 1960s, they remain limited in interactivity and pay little attention to the narrative dimensions of career development. We introduce XR-CareerAssist, a platform that unifies Extended Reality (XR) with several Artificial Intelligence (AI) modules to deliver immersive, multilingual career guidance. The system integrates Automatic Speech Recognition for voice-driven interaction, Neural Machine Translation across English, Greek, French, and Italian, a Langchain-based conversational Training Assistant for personalised dialogue, a BLIP-based Vision-Language model for career visualisations, and AWS Polly Text-to-Speech delivered through an interactive 3D avatar. Career trajectories are rendered as dynamic Sankey diagrams derived from a repository of more than 100,000 anonymised professional profiles. The application was built in Unity for Meta Quest 3, with backend services hosted on AWS. A pilot evaluation at the University of Exeter with 23 participants returned 95.6% speech recognition accuracy, 78.3% overall user satisfaction, and 91.3% favourable ratings for system responsiveness, with feedback informing subsequent improvements to motion comfort, audio clarity, and text legibility. XR-CareerAssist demonstrates how the fusion of XR and AI can produce more engaging, accessible, and effective career development tools, with the integration of five AI modules within a single immersive environment yielding a multimodal interaction experience that distinguishes it from existing career guidance platforms.

关键词: Extended Reality, Career Guidance, Multimodal AI, Immersive Platform, Personalised Dialogue, Vision-Language Model, Automatic Speech Recognition, Neural Machine Translation

84. ❌ Do We Need Distinct Representations for Every Speech Token? Unveiling and Exploiting Redundancy in Large Speech Language Models

作者: Bajian Xiang, Tingwei Guo, Xuan Chen, Yang Han 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06871v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	8.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	8.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究大型语音语言模型（LSLMs）的效率问题，提出Affinity Pooling方法压缩语音表示。核心相关关键词：1）‘Large Language Models’（10分）- 论文明确研究大型语音语言模型，是核心内容；2）‘Quantization OR Model Compression’（8分）- 论文提出token合并机制压缩表示，属于模型压缩范畴；3）‘Speculative Decoding OR Inference Acceleration’（8分）- 论文减少FLOPs、提升推理速度，属于推理加速。其他关键词如MoE、SFT、RAG等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究发现大型语音语言模型中存在结构化冗余，提出无需训练的Affinity Pooling方法压缩语音表示，在保持语义准确性的同时显著提升推理效率。

摘要翻译

大型语音语言模型（Large Speech Language Models, LSLMs）通常以较高的标记速率（标记/秒）运行以确保声学保真度，但这导致序列长度远超底层语义内容所需，产生了过高的推理成本。本文通过实证研究重新审视了这种细粒度标记级处理的必要性。通过逐层理想干预实验，我们揭示了一种结构化的冗余层级：浅层编码了必要的声学细节，而深层则表现出极高的冗余度，允许进行大幅压缩。基于这些发现，我们提出了亲和池化（Affinity Pooling），一种无需训练、基于相似性的标记合并机制。通过在输入层和深层策略性地应用该方法，我们能够有效压缩语音表征而不损失语义信息。在三个任务上的广泛评估表明，我们的方法在保持竞争力准确度的同时，将预填充浮点运算量降低了27.48%。实际部署进一步证实了显著的效率提升，在长语音输入上实现了约1.7倍的内存节省和约1.1倍的首标记生成加速。我们的研究结果挑战了完全独立标记表征的必要性，为提升LSLM效率提供了新的视角。

摘要 (Abstract)

Large Speech Language Models (LSLMs) typically operate at high token rates (tokens/s) to ensure acoustic fidelity, yet this results in sequence lengths that far exceed the underlying semantic content, incurring prohibitive inference costs. In this paper, we empirically revisit the necessity of such granular token-level processing. Through layer-wise oracle interventions, we unveil a structured redundancy hierarchy: while shallow layers encode essential acoustic details, deep layers exhibit extreme redundancy, allowing for aggressive compression. Motivated by these findings, we introduce Affinity Pooling, a training-free, similarity-based token merging mechanism. By strategically applying this method at both input and deep layers, we effectively compress speech representations without compromising semantic information. Extensive evaluations across three tasks demonstrate that our approach reduces prefilling FLOPs by 27.48% while maintaining competitive accuracy. Practical deployment further confirms significant efficiency gains, yielding up to $\sim$1.7$\times$ memory savings and $\sim$1.1$\times$ faster time-to-first-token on long utterances. Our results challenge the necessity of fully distinct token representations, providing new perspectives on LSLM efficiency.

关键词: Large Speech Language Models, token redundancy, representation compression, Affinity Pooling, inference efficiency, FLOPs reduction, memory savings, time-to-first-token

85. ❌ Physical Adversarial Attacks on AI Surveillance Systems:Detection, Tracking, and Visible–Infrared Evasion

作者: Miguel A. DelaCruz, Patricia Mae Santos, Rafael T. Navarro 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06865v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究物理对抗攻击在AI监控系统中的应用，重点关注检测、跟踪和多模态（可见光-红外）感知等计算机视觉任务。论文内容属于计算机视觉安全领域，而非大模型或深度学习技术原理的创新，也未涉及大模型在不同领域的应用。所有评分关键词均与大模型技术、训练方法、推理优化、对齐、代理系统等直接相关，而本文完全不涉及这些主题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文从监控系统视角综述物理对抗攻击，提出需从时间持续性、传感模态、载体真实性和系统级目标等维度评估攻击效果，并指出孤立单帧基准无法可靠评估监控鲁棒性。

摘要翻译

物理对抗攻击的研究日益聚焦于接近实际部署监控系统的场景，而非孤立的图像基准测试。在这些场景中，人员检测、多目标跟踪、可见光-红外感知以及攻击载体的实际形态均需同时考量。这改变了我们解读相关文献的方式：一种在单帧图像中抑制检测器的扰动，若目标身份随时间推移被恢复，则其实际效果可能有限；仅基于RGB图像的研究结论，对于依赖可见光与热成像协同输入的夜间系统可能参考价值不大；而显眼的对抗补丁所隐含的威胁模型，也与可穿戴或选择性激活的载体有所不同。本文从这一面向监控的视角出发，综述物理对抗攻击。我们并非试图完整罗列计算机视觉中的所有物理攻击，而是聚焦于在监控系统中变得至关重要的技术问题：时间持续性、感知模态、载体真实性以及系统级目标。我们通过一个四部分分类法梳理已有工作，并讨论多目标跟踪、双模态可见光-红外规避以及可控服装方面的最新成果如何反映了该领域更广泛的变化。我们还总结了评估实践和尚存的不足，包括距离鲁棒性、相机成像管线差异、身份级度量指标以及激活感知测试。由此得出的结论是：监控系统的鲁棒性无法仅从孤立的逐帧基准测试中可靠判断；必须将其作为一个随时间展开、跨传感器协作、并在实际物理部署约束下的系统性问题来审视。

摘要 (Abstract)

Physical adversarial attacks are increasingly studied in settings that resemble deployed surveillance systems rather than isolated image benchmarks. In these settings, person detection, multi-object tracking, visible–infrared sensing, and the practical form of the attack carrier all matter at once. This changes how the literature should be read. A perturbation that suppresses a detector in one frame may have limited practical effect if identity is recovered over time; an RGB-only result may say little about night-time systems that rely on visible and thermal inputs together; and a conspicuous patch can imply a different threat model from a wearable or selectively activated carrier. This paper reviews physical attacks from that surveillance-oriented viewpoint. Rather than attempting a complete catalogue of all physical attacks in computer vision, we focus on the technical questions that become central in surveillance: temporal persistence, sensing modality, carrier realism, and system-level objective. We organize prior work through a four-part taxonomy and discuss how recent results on multi-object tracking, dual-modal visible–infrared evasion, and controllable clothing reflect a broader change in the field. We also summarize evaluation practices and unresolved gaps, including distance robustness, camera-pipeline variation, identity-level metrics, and activation-aware testing. The resulting picture is that surveillance robustness cannot be judged reliably from isolated per-frame benchmarks alone; it has to be examined as a system problem unfolding over time, across sensors, and under realistic physical deployment constraints.

关键词: Physical adversarial attacks, AI surveillance systems, Person detection, Multi-object tracking, Visible-infrared sensing, Temporal persistence, System-level evaluation, Surveillance robustness

86. ❌ Digital Skin, Digital Bias: Uncovering Tone-Based Biases in LLMs and Emoji Embeddings

作者: Mingchen Li, Wajdi Aljedaani, Yingjie Liu, Navyasri Meka, Xuan Lu, Xinyue Ye, Junhua Ding, Yunhe Feng 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06863v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在肤色表情符号表示中的偏见问题，直接涉及’Large Language Models’关键词（10分）。研究通过分析语义一致性、表示相似性、情感极性等揭示偏见，与’Mechanistic Interpretability’和’Hallucination Mitigation’有一定关联（各5分），因为涉及模型解释性和事实性/真实性评估。其他关键词如MoE、SLMs、训练方法、推理技术、压缩、代理等均未在论文中涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文首次大规模比较了专用表情符号嵌入模型与四种现代LLMs在肤色表情符号表示中的偏见，发现LLMs对肤色修饰符支持良好但存在系统性偏见，而专用模型存在严重缺陷，强调了审计和缓解这些表示危害的紧迫性。

摘要翻译

肤色表情符号对于在线交流中个人身份认同与社会包容性的构建至关重要。随着人工智能模型，特别是大语言模型（LLMs）日益成为网络平台交互的中介，这些系统可能通过对此类符号的表征延续社会偏见，这一风险引发了重大关切。本文首次针对两类不同模型在肤色表情符号表征中的偏见进行了大规模比较研究。我们系统性地评估了专用表情符号嵌入模型（emoji2vec、emoji-sw2v）与四种现代大语言模型（Llama、Gemma、Qwen 和 Mistral）。我们的分析首先揭示了一个关键的性能差距：尽管大语言模型展现出对肤色修饰符的稳健支持，但广泛使用的专用表情符号模型却表现出严重缺陷。更重要的是，通过对语义一致性、表征相似性、情感极性和核心偏见的多元探究，我们发现了系统性的差异。我们找到了证据表明，不同肤色的表情符号存在情感倾向的偏差和含义的不一致，这凸显了这些基础模型中潜在的偏见。我们的研究结果强调，开发者和平台迫切需要审计并减轻这些表征性危害，以确保人工智能在网络上的作用能促进真正的公平，而非强化社会偏见。

摘要 (Abstract)

Skin-toned emojis are crucial for fostering personal identity and social inclusion in online communication. As AI models, particularly Large Language Models (LLMs), increasingly mediate interactions on web platforms, the risk that these systems perpetuate societal biases through their representation of such symbols is a significant concern. This paper presents the first large-scale comparative study of bias in skin-toned emoji representations across two distinct model classes. We systematically evaluate dedicated emoji embedding models (emoji2vec, emoji-sw2v) against four modern LLMs (Llama, Gemma, Qwen, and Mistral). Our analysis first reveals a critical performance gap: while LLMs demonstrate robust support for skin tone modifiers, widely-used specialized emoji models exhibit severe deficiencies. More importantly, a multi-faceted investigation into semantic consistency, representational similarity, sentiment polarity, and core biases uncovers systemic disparities. We find evidence of skewed sentiment and inconsistent meanings associated with emojis across different skin tones, highlighting latent biases within these foundational models. Our findings underscore the urgent need for developers and platforms to audit and mitigate these representational harms, ensuring that AI’s role on the web promotes genuine equity rather than reinforcing societal biases.

关键词: Large Language Models, bias, emoji representations, skin tone, semantic consistency, sentiment polarity, equity, AI fairness

87. ❌ MedDialBench: Benchmarking LLM Diagnostic Robustness under Parametric Adversarial Patient Behaviors

作者: Xiaotian Luo, Xun Jiang, Jiangcheng Wu 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06846v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究LLM在医学对话诊断中的鲁棒性评估，与’Large Language Models’高度相关（10分），属于’AI for Science’在生物医学领域的应用（10分）。论文涉及LLM诊断准确性和对抗性行为影响，与’Factuality’有一定关联（5分），但未深入技术细节。其他关键词如MoE、SFT、RAG等未在论文中涉及，均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了LLM在医学诊断对话中面对对抗性患者行为时的鲁棒性，发现信息污染（伪造症状）比信息缺失（隐瞒信息）导致更严重的准确性下降，且模型表现出不同的脆弱性模式。

摘要翻译

交互式医疗对话基准测试表明，当面对非合作患者时，大语言模型的诊断准确性会显著下降。然而，现有方法要么在应用对抗性行为时缺乏分级的严重程度和病例特异性依据，要么将患者的不合作行为简化为单一的、未分级的维度，且均未分析跨维度间的交互作用。
我们提出了MedDialBench基准，旨在对个体患者行为维度如何影响大语言模型诊断鲁棒性进行可控的剂量-效应表征。该基准将患者行为分解为五个维度——逻辑一致性、健康认知、表达风格、信息透露和态度——每个维度均设有分级的严重程度，并配有病例特异性的行为脚本。这种可控的析因设计支持分级敏感性分析、剂量-效应分析以及跨维度交互作用检测。
通过对五个前沿大语言模型在7,225个对话（85个病例 × 17种配置 × 5个模型）中进行评估，我们发现了一个根本性的不对称现象：信息污染（捏造症状）导致的准确性下降幅度是信息缺失（隐瞒信息）的1.7至3.4倍，并且捏造是唯一在所有五个模型上都达到统计显著性的配置（McNemar检验 p < 0.05）。在六种维度组合中，捏造是产生超加性交互作用的唯一驱动因素：所有三个涉及捏造的组合都产生了0.70-0.81的观测/预期比值（即35-44%符合条件的病例在组合情况下失败，而在单一维度下却成功），而所有不涉及捏造的组合则显示出纯粹的加性效应（观测/预期比值 ~ 1.0）。问询策略能缓解信息缺失的影响，但对信息污染无效：详尽的提问可以恢复被隐瞒的信息，但无法补偿捏造的输入。各模型表现出不同的脆弱性特征，在最坏情况下准确性下降幅度从38.8到54.1个百分点不等。

摘要 (Abstract)

Interactive medical dialogue benchmarks have shown that LLM diagnostic accuracy degrades significantly when interacting with non-cooperative patients, yet existing approaches either apply adversarial behaviors without graded severity or case-specific grounding, or reduce patient non-cooperation to a single ungraded axis, and none analyze cross-dimension interactions. We introduce MedDialBench, a benchmark enabling controlled, dose-response characterization of how individual patient behavior dimensions affect LLM diagnostic robustness. It decomposes patient behavior into five dimensions – Logic Consistency, Health Cognition, Expression Style, Disclosure, and Attitude – each with graded severity levels and case-specific behavioral scripts. This controlled factorial design enables graded sensitivity analysis, dose-response profiling, and cross-dimension interaction detection. Evaluating five frontier LLMs across 7,225 dialogues (85 cases x 17 configurations x 5 models), we find a fundamental asymmetry: information pollution (fabricating symptoms) produces 1.7-3.4x larger accuracy drops than information deficit (withholding information), and fabricating is the only configuration achieving statistical significance across all five models (McNemar p < 0.05). Among six dimension combinations, fabricating is the sole driver of super-additive interaction: all three fabricating-involving pairs produce O/E ratios of 0.70-0.81 (35-44% of eligible cases fail under the combination despite succeeding under each dimension alone), while all non-fabricating pairs show purely additive effects (O/E ~ 1.0). Inquiry strategy moderates deficit but not pollution: exhaustive questioning recovers withheld information, but cannot compensate for fabricated inputs. Models exhibit distinct vulnerability profiles, with worst-case drops ranging from 38.8 to 54.1 percentage points.

关键词: LLM diagnostic robustness, medical dialogue benchmark, adversarial patient behaviors, information pollution, information deficit, graded sensitivity analysis, cross-dimension interactions, MedDialBench

88. ❌ HingeMem: Boundary Guided Long-Term Memory with Query Adaptive Retrieval for Scalable Dialogues

作者: Yijie Zhong, Yunfan Gao, Haofen Wang 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06845v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	5.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文HingeMem专注于对话系统的长期记忆机制，属于大模型应用研究。核心相关关键词：1）‘Large Language Models’（10分）：论文明确在LLM规模（0.6B到生产级模型）上进行实验，是核心应用基础；2）‘Retrieval-Augmented Generation’（10分）：论文提出边界引导的长期记忆和查询自适应检索机制，本质上是RAG的改进；3）‘LLM Agents’（10分）：论文针对对话系统（可视为代理）的可持续交互，属于代理工作流研究。部分相关关键词：‘Small Language Models’（5分）：实验包括0.6B小模型；‘Context Window Extension’（5分）：长期记忆处理可视为扩展上下文的一种方式；‘Mechanistic Interpretability’（5分）：边界触发超边提供了可解释的索引接口。其他关键词与论文的对话记忆、检索优化主题无关，得0分。

!!! tip deepseek-chat TL;DR

该论文针对对话系统中长期记忆的适应性和效率问题，提出了HingeMem方法，通过边界引导的事件分割和查询自适应检索机制，在多个LLM规模上实现了约20%的性能提升并显著降低了计算成本。

摘要翻译

长期记忆对于支持持续、可持续且个性化交互的对话系统至关重要。然而，现有方法依赖于连续摘要或基于开放信息抽取的图构建，并搭配固定的Top-\textit{k}检索机制，导致其跨查询类别的适应性有限且计算开销较高。本文提出HingeMem，一种边界引导的长期记忆机制，它通过事件分割理论实现操作化，利用边界触发的超边在四个要素（人物、时间、地点和主题）上构建可解释的索引接口。当任一要素发生变化时，HingeMem会划定边界并写入当前片段，从而减少冗余操作并保留关键上下文。为满足多样化的信息需求并实现鲁棒高效的检索，HingeMem引入了查询自适应检索机制，该机制联合决策：（a）\textit{检索什么}：确定基于要素索引记忆的查询条件路由；（b）\textit{检索多少}：根据估计的查询类型控制检索深度。在LOCOMO数据集上跨大语言模型规模（从0.6B到生产级模型，\textit{例如}Qwen3-0.6B至Qwen-Flash）的大量实验表明，在无需指定查询类别的情况下，HingeMem相比强基线实现了约$20%$的相对性能提升，同时显著降低了计算成本（与HippoRAG2相比，问答令牌成本降低68%$\downarrow$）。除了推进记忆建模外，HingeMem的自适应检索特性使其非常适合需要高效且可信长期记忆的Web应用场景。

摘要 (Abstract)

Long-term memory is critical for dialogue systems that support continuous, sustainable, and personalized interactions. However, existing methods rely on continuous summarization or OpenIE-based graph construction paired with fixed Top-\textit{k} retrieval, leading to limited adaptability across query categories and high computational overhead. In this paper, we propose HingeMem, a boundary-guided long-term memory that operationalizes event segmentation theory to build an interpretable indexing interface via boundary-triggered hyperedges over four elements: person, time, location, and topic. When any such element changes, HingeMem draws a boundary and writes the current segment, thereby reducing redundant operations and preserving salient context. To enable robust and efficient retrieval under diverse information needs, HingeMem introduces query-adaptive retrieval mechanisms that jointly decide (a) \textit{what to retrieve}: determine the query-conditioned routing over the element-indexed memory; (b) \textit{how much to retrieve}: control the retrieval depth based on the estimated query type. Extensive experiments across LLM scales (from 0.6B to production-tier models; \textit{e.g.}, Qwen3-0.6B to Qwen-Flash) on LOCOMO show that HingeMem achieves approximately $20%$ relative improvement over strong baselines without query categories specification, while reducing computational cost (68%$\downarrow$ question answering token cost compared to HippoRAG2). Beyond advancing memory modeling, HingeMem’s adaptive retrieval makes it a strong fit for web applications requiring efficient and trustworthy memory over extended interactions.

关键词: long-term memory, dialogue systems, retrieval-augmented generation, query-adaptive retrieval, event segmentation, LLM agents, computational efficiency, boundary-guided memory

89. ❌ Explaining Neural Networks in Preference Learning: a Post-hoc Inductive Logic Programming Approach

作者: Daniele Fossemò, Filippo Mignosi, Giuseppe Placidi, Luca Raggioli, Matteo Spezialetti, Fabio Aurelio D’Asaro 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06838v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究使用归纳逻辑编程（ILASP）来解释神经网络在偏好学习任务中的行为，属于可解释AI（XAI）领域。论文核心是模型解释方法，与大多数大模型技术关键词（如LLM、MoE、SFT、RLHF、RAG等）无关。唯一相关的关键词是’Mechanistic Interpretability OR Explainable AI’，因为论文直接研究神经网络的可解释性方法，评分为10分（高度相关）。其他关键词均未涉及，评分为0分。论文未提及任何指定专家作者。

!!! tip deepseek-chat TL;DR

该论文提出使用基于答案集编程的归纳学习（ILASP）方法来解释神经网络在用户偏好学习任务中的决策过程，并通过降维处理提高了高维特征空间中的解释效率和透明度。

摘要翻译

本文提出在用户偏好学习这一特定场景中，采用从答案集学习的方法来近似黑盒模型（如神经网络）。我们重点探索如何利用ILASP（答案集程序的归纳学习）通过弱约束来近似偏好学习系统。我们构建了一个关于用户对菜谱偏好的数据集，用于训练神经网络，并试图以ILASP对其进行近似。实验研究了ILASP作为神经网络的全局近似器与局部近似器的表现。这些实验旨在应对以下挑战：在特征空间维度不断增高的情况下，如何有效近似神经网络，同时保持对目标模型的适当保真度并控制计算时间的增长。为应对这一挑战，我们提出一种预处理步骤，利用主成分分析在保持解释透明性的同时降低数据集的维度。本文已投稿至《逻辑程序设计的理论与实践》（TPLP）期刊审议。

摘要 (Abstract)

In this paper, we propose using Learning from Answer Sets to approximate black-box models, such as Neural Networks (NN), in the specific case of learning user preferences. We specifically explore the use of ILASP (Inductive Learning of Answer Set Programs) to approximate preference learning systems through weak constraints. We have created a dataset on user preferences over a set of recipes, which is used to train the NNs that we aim to approximate with ILASP. Our experiments investigate ILASP both as a global and a local approximator of the NNs. These experiments address the challenge of approximating NNs working on increasingly high-dimensional feature spaces while achieving appropriate fidelity on the target model and limiting the increase in computational time. To handle this challenge, we propose a preprocessing step that exploits Principal Component Analysis to reduce the dataset’s dimensionality while keeping our explanations transparent. Under consideration for publication in Theory and Practice of Logic Programming (TPLP).

关键词: Neural Networks, Preference Learning, Explainable AI, Inductive Logic Programming, ILASP, Model Approximation, Principal Component Analysis, Answer Set Programming

90. ❌ On the Step Length Confounding in LLM Reasoning Data Selection

作者: Bing Wang, Rui Miao, Chen Shen, Shaotian Yan, Kaiyuan Liu, Ximing Li, Xiaosong Yuan, Sinan Fan, Jun Zhang, Jieping Ye 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06834v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM推理数据选择中的步长混淆问题，直接涉及LLMs、监督微调和思维链推理，因此这些关键词得10分。论文关注数据质量与推理深度，与Scaling Laws & Data Quality和System 2 Thinking有一定关联，得5分。其他关键词如MoE、SLMs、对齐、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文发现LLM推理数据选择中基于自然度的筛选方法存在步长混淆问题，即偏好长推理步长而非高质量样本，并提出两种变体方法有效缓解此问题。

摘要翻译

近期，大型推理模型通过在高质量大规模数据集上进行监督微调，在需要长链思维推理的复杂任务中展现出强大性能。为构建此类数据集，现有流程通常从能力更强的大型语言模型生成长推理数据，并采用人工启发式或基于自然度的筛选方法来过滤高质量样本。尽管基于自然度的数据选择方法（即通过LLM赋予的平均对数概率对数据排序）已被证实有效，但我们的分析表明，当应用于LLM推理数据集时，该方法会系统性地偏好推理步骤更长的样本（即每步包含更多标记），而非质量更高的样本，我们将此现象称为步长混淆。通过定量分析，我们将该现象归因于推理步骤中首标记的低概率特性：更长的步骤会稀释其影响，从而抬升平均对数概率值。为解决此问题，我们提出两种改进方法：ASLEC-DROP（在计算平均对数概率时剔除首标记概率）和ASLEC-CASL（应用因果去偏回归消除首标记的混淆效应）。在四种LLM和五个评估基准上的实验证明了我们方法在缓解步长混淆问题上的有效性。

摘要 (Abstract)

Large reasoning models have recently demonstrated strong performance on complex tasks that require long chain-of-thought reasoning, through supervised fine-tuning on large-scale and high-quality datasets. To construct such datasets, existing pipelines generate long reasoning data from more capable Large Language Models (LLMs) and apply manually heuristic or naturalness-based selection methods to filter high-quality samples. Despite the proven effectiveness of naturalness-based data selection, which ranks data by the average log probability assigned by LLMs, our analysis shows that, when applied to LLM reasoning datasets, it systematically prefers samples with longer reasoning steps (i.e., more tokens per step) rather than higher-quality ones, a phenomenon we term step length confounding. Through quantitative analysis, we attribute this phenomenon to low-probability first tokens in reasoning steps; longer steps dilute their influence, thereby inflating the average log probabilities. To address this issue, we propose two variant methods: ASLEC-DROP, which drops first-token probabilities when computing average log probability, and ASLEC-CASL, which applies a causal debiasing regression to remove the first tokens’ confounding effect. Experiments across four LLMs and five evaluation benchmarks demonstrate the effectiveness of our approach in mitigating the step length confounding problem.

关键词: Large Language Models, reasoning data selection, step length confounding, supervised fine-tuning, chain-of-thought reasoning, data quality, log probability, ASLEC

91. ❌ Towards Privacy-Preserving Large Language Model: Text-free Inference Through Alignment and Adaptation

作者: Jeongho Yoon, Chanhee Park, Yongchan Chun, Hyeonseok Moon, Heuiseok Lim 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06831v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究隐私保护的大语言模型推理方法，通过两阶段训练（对齐和微调）实现文本不传输的推理。与LLMs、Fine-tuning、Alignment高度相关（10分），与Domain Adaptation有一定关联（5分），其他关键词未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种隐私保护的大语言模型微调方法（PPFT），通过客户端编码和服务器端投影模块，在不传输原始文本的情况下实现高效推理，在隐私保护和模型性能之间取得了良好平衡。

摘要翻译

当前基于大语言模型的服务通常要求用户提交原始文本，而无论其敏感性如何。尽管这种做法直观，却带来了显著的隐私风险，因为未经授权的访问可能暴露个人、医疗或法律信息。尽管先前的防御措施致力于降低这些风险，但它们往往产生大量计算开销并降低模型性能。为克服这种隐私与效率的权衡，我们提出了隐私保护微调（Privacy-Preserving Fine-Tuning, PPFT），这是一种新颖的训练流程，无需传输原始提示文本，同时为客户端和服务提供商在隐私保护与模型效用之间保持了有利的平衡。我们的方法分两个阶段运行：首先，我们训练一个客户端编码器以及服务器端的投影模块和大语言模型，使服务器能够基于k池化提示嵌入而非原始文本进行条件生成；其次，我们使用注入噪声的嵌入在私有领域特定数据上微调投影模块和大语言模型，从而实现有效适应，而无需暴露明文提示，也无需访问解码器的内部参数。在领域特定和通用基准测试上的大量实验表明，PPFT在隐私与效用之间实现了显著的平衡，与无噪声上限相比，在性能下降最小的情况下保持了竞争力。

摘要 (Abstract)

Current LLM-based services typically require users to submit raw text regardless of its sensitivity. While intuitive, such practice introduces substantial privacy risks, as unauthorized access may expose personal, medical, or legal information. Although prior defenses strived to mitigate these risks, they often incur substantial computational overhead and degrade model performance. To overcome this privacy-efficiency trade-off, we introduce Privacy-Preserving Fine-Tuning (PPFT), a novel training pipeline that eliminates the need for transmitting raw prompt text while maintaining a favorable balance between privacy preservation and model utility for both clients and service providers. Our approach operates in two stages: first, we train a client-side encoder together with a server-side projection module and LLM, enabling the server to condition on k-pooled prompt embeddings instead of raw text; second, we fine-tune the projection module and LLM on private, domain-specific data using noise-injected embeddings, allowing effective adaptation without exposing plain text prompts and requiring access to the decoder’s internal parameters. Extensive experiments on domain-specific and general benchmarks demonstrate that PPFT achieves a striking balance between privacy and utility, maintaining competitive performance with minimal degradation compared to noise-free upper bounds.

关键词: Privacy-Preserving, Large Language Models, Fine-Tuning, Alignment, Domain Adaptation, Text-free Inference, Embedding-based Inference, Privacy-Efficiency Trade-off

92. ❌ WRAP++: Web discoveRy Amplified Pretraining

作者: Jiang Zhou, Yunhao Wang, Xing Wu, Tinghao Yu, Feng Zhang 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06829v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM预训练数据增强方法，通过发现跨文档关系生成合成QA数据，因此与’Large Language Models’和’Pre-training’高度相关（10分）。论文涉及数据质量和规模扩展，与’Scaling Laws AND Data Quality’有一定关联（5分）。其他关键词如MoE、SFT、RAG、推理方法、AI for Science等均未在摘要中提及或相关，故评0分。

!!! tip deepseek-chat TL;DR

该论文提出WRAP++方法，通过发现网页间的跨文档关系生成联合QA数据来增强LLM预训练，实验表明该方法能显著提升模型在知识问答任务上的性能并带来持续的扩展收益。

摘要翻译

合成数据重述已成为增强大语言模型预训练期间知识获取的一项强大技术。然而，现有方法仅在单文档层面操作，孤立地重写单个网页。这限制了合成示例只能包含文档内部知识，遗漏了跨文档关系，使得事实知识缺乏关联性上下文。我们提出WRAP++（网络发现增强预训练），该方法通过从网页超链接中发现跨文档关系，并为每个发现的文档对合成联合问答，从而增强事实知识的关联性上下文。具体而言，WRAP++发现包括双向链接和共同提及在内的高置信度关系模式，并合成需要跨两个文档进行推理的问答。这产生了任一源文档单独均不具备的关系性知识，为相同事实创造了多样化的知识切入点。由于有效实体对的数量呈组合式增长，这种发现驱动的合成方法也使得数据规模远超单文档重述。在维基百科上实例化WRAP++，我们将约84亿词元的原始文本增强为800亿词元的跨文档问答数据。在SimpleQA基准测试中，基于OLMo架构、使用WRAP++数据训练的7B和32B规模模型均显著优于单文档方法，并展现出持续的扩展收益，这凸显了跨文档知识发现与增强的优势。

摘要 (Abstract)

Synthetic data rephrasing has emerged as a powerful technique for enhancing knowledge acquisition during large language model (LLM) pretraining. However, existing approaches operate at the single-document level, rewriting individual web pages in isolation. This confines synthesized examples to intra-document knowledge, missing cross-document relationships and leaving facts with limited associative context. We propose WRAP++ (Web discoveRy Amplified Pretraining), which amplifies the associative context of factual knowledge by discovering cross-document relationships from web hyperlinks and synthesizing joint QA over each discovered document pair. Concretely, WRAP++ discovers high-confidence relational motifs including dual-links and co-mentions, and synthesizes QA that requires reasoning across both documents. This produces relational knowledge absent from either source document alone, creating diverse entry points to the same facts. Because the number of valid entity pairs grows combinatorially, this discovery-driven synthesis also amplifies data scale far beyond single-document rewriting. Instantiating WRAP++ on Wikipedia, we amplify ~8.4B tokens of raw text into 80B tokens of cross-document QA data. On SimpleQA, OLMo-based models at both 7B and 32B scales trained with WRAP++ substantially outperform single-document approaches and exhibit sustained scaling gains, underscoring the advantage of cross-document knowledge discovery and amplification.

关键词: Large Language Models, Pretraining, Synthetic Data, Cross-document Relationships, Knowledge Amplification, QA Generation, Data Scaling, Web Hyperlinks

93. ❌ OmniTabBench: Mapping the Empirical Frontiers of GBDTs, Neural Networks, and Foundation Models for Tabular Data at Scale

作者: Dihong Jiang, Ruoqi Cao, Zhiyuan Dang, Li Huang, Qingsong Zhang, Zhiyu Wang, Shihao Piao, Shenggao Zhu, Jianlong Chang, Zhouchen Lin, Qi Tian 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06814v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究表格数据建模的基准测试和模型比较，仅与’Large Language Models OR LLMs OR Foundation Models’关键词有中等相关性（5分），因为论文提到使用大语言模型对数据集进行分类，但这不是研究的核心内容。其他关键词均与论文主题无关（0分），论文未涉及大模型技术原理、训练方法、推理优化、对齐、代理系统、科学AI应用等具体技术。

!!! tip deepseek-chat TL;DR

该论文通过构建包含3030个数据集的大规模表格基准OmniTabBench，评估了传统树模型、神经网络和基础模型在表格数据上的表现，发现没有单一模型在所有情况下占优，并通过元特征分析揭示了不同模型适用的具体条件。

摘要翻译

尽管传统的基于树的集成方法长期主导着表格任务，但深度神经网络和新兴的基础模型已对其主导地位构成挑战，然而学界尚未就一种普遍优越的范式达成共识。现有基准通常包含不足100个数据集，这引发了关于评估充分性和潜在选择偏见的担忧。为应对这些局限，我们提出了OmniTabBench——迄今为止规模最大的表格基准，它涵盖3030个数据集，涉及多样化的任务；这些数据集从多种来源全面收集，并利用大语言模型按行业进行分类。我们在OmniTabBench上对所有模型家族中的最先进模型进行了前所未有的大规模实证评估，结果证实并不存在一个占据绝对优势的优胜者。此外，通过解耦的元特征分析——该分析考察了数据集规模、特征类型、特征与目标偏度/峰度等个体属性——我们阐明了更有利于特定模型类别的条件，从而提供了比先前复合指标研究更清晰、更具可操作性的指导。

摘要 (Abstract)

While traditional tree-based ensemble methods have long dominated tabular tasks, deep neural networks and emerging foundation models have challenged this primacy, yet no consensus exists on a universally superior paradigm. Existing benchmarks typically contain fewer than 100 datasets, raising concerns about evaluation sufficiency and potential selection biases. To address these limitations, we introduce OmniTabBench, the largest tabular benchmark to date, comprising 3030 datasets spanning diverse tasks that are comprehensively collected from diverse sources and categorized by industry using large language models. We conduct an unprecedented large-scale empirical evaluation of state-of-the-art models from all model families on OmniTabBench, confirming the absence of a dominant winner. Furthermore, through a decoupled metafeature analysis, which examines individual properties such as dataset size, feature types, feature and target skewness/kurtosis, we elucidate conditions favoring specific model categories, providing clearer, more actionable guidance than prior compound-metric studies.

关键词: tabular data, benchmark, foundation models, empirical evaluation, model comparison, dataset categorization, metafeature analysis, OmniTabBench

94. ❌ SkillTrojan: Backdoor Attacks on Skill-Based Agent Systems

作者: Yunhao Feng, Yifan Ding, Yingshui Tan, Boren Zheng, Yanming Guo, Xiaolong Li, Kun Zhai, Yishan Li, Wenke Huang 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06811v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	8.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究基于技能的智能体系统中的后门攻击，核心涉及LLM驱动的智能体（Agent）系统。与’LLM Agents’高度相关（10分），因为论文直接研究技能型智能体系统；与’Tool Use’相关（8分），因为技能可视为工具/函数调用；与’Multi-agent Systems’有一定关联（5分），涉及技能组合和生态系统；与’Large Language Models’相关（8分），因为实验使用了GPT-5.2模型。其他关键词如MoE、SFT、RAG等均未涉及，故评0分。

!!! tip deepseek-chat TL;DR

论文提出SkillTrojan，一种针对技能型智能体系统的后门攻击方法，通过将恶意负载嵌入看似正常的技能中，在特定触发条件下执行攻击，实验表明该方法在GPT-5.2上攻击成功率高达97.2%且对正常任务性能影响很小。

摘要翻译

基于技能的智能体系统通过组合可复用的技能来处理复杂任务，在提升模块化与可扩展性的同时，也引入了一个尚未被充分审视的安全攻击面。本文提出SkillTrojan——一种以后门攻击技能实现而非模型参数或训练数据为目标的攻击方法。SkillTrojan将恶意逻辑嵌入看似合理的技能中，并利用标准技能组合机制来重构并执行攻击者指定的载荷。该攻击将加密载荷分割至多个表面正常的技能调用中，仅在满足预设触发条件时激活。SkillTrojan还支持从任意技能模板自动合成后门技能，从而实现在基于技能的智能体生态系统中进行规模化传播。为支持系统化评估，我们发布了一个包含3000余个精选后门技能的数据集，涵盖多样化的技能模式及触发-载荷配置。我们在一个典型的基于代码的智能体环境中实例化了SkillTrojan，并评估了其在正常任务效用与攻击成功率两方面的表现。结果表明，技能层级的后门攻击在几乎不影响良性行为的前提下仍能保持极高攻击效率，这揭示了当前基于技能的智能体架构中存在一个关键盲区，并促使我们需构建能显式推理技能组合与执行的防御机制。具体而言，在EHR SQL任务中，SkillTrojan在GPT-5.2-1211-Global模型上实现了最高97.2%的攻击成功率（ASR），同时保持了89.3%的正常任务准确率（ACC）。

摘要 (Abstract)

Skill-based agent systems tackle complex tasks by composing reusable skills, improving modularity and scalability while introducing a largely unexamined security attack surface. We propose SkillTrojan, a backdoor attack that targets skill implementations rather than model parameters or training data. SkillTrojan embeds malicious logic inside otherwise plausible skills and leverages standard skill composition to reconstruct and execute an attacker-specified payload. The attack partitions an encrypted payload across multiple benign-looking skill invocations and activates only under a predefined trigger. SkillTrojan also supports automated synthesis of backdoored skills from arbitrary skill templates, enabling scalable propagation across skill-based agent ecosystems. To enable systematic evaluation, we release a dataset of 3,000+ curated backdoored skills spanning diverse skill patterns and trigger-payload configurations. We instantiate SkillTrojan in a representative code-based agent setting and evaluate both clean-task utility and attack success rate. Our results show that skill-level backdoors can be highly effective with minimal degradation of benign behavior, exposing a critical blind spot in current skill-based agent architectures and motivating defenses that explicitly reason about skill composition and execution. Concretely, on EHR SQL, SkillTrojan attains up to 97.2% ASR while maintaining 89.3% clean ACC on GPT-5.2-1211-Global.

关键词: Skill-based agent systems, Backdoor attack, SkillTrojan, Skill composition, Attack success rate, GPT-5.2, EHR SQL, Security vulnerability

95. ❌ Riemann-Bench: A Benchmark for Moonshot Mathematics

作者: Suhaas Garre, Erik Knutsen, Sushant Mehta, Edwin Chen 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06802v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	8.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文主要研究AI系统在高级数学研究问题上的评估，属于AI for Science范畴（高度相关10分）。论文评估前沿模型作为研究代理（LLM Agents相关8分），涉及开放推理（System 2 Thinking相关8分）和工具使用（Tool Use相关5分），并提到多步推理（Chain of Thought相关5分）。论文未涉及具体的大模型技术细节（如MoE、量化、训练方法等），因此其他关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文引入了一个名为Riemann-Bench的私有基准测试，用于评估AI系统在超越奥林匹克竞赛水平的研究级数学问题上的能力，结果显示当前前沿模型在100次独立运行中的得分均低于10%，揭示了竞赛级问题解决与真正研究级数学推理之间的巨大差距。

摘要翻译

近期人工智能系统已在国际数学奥林匹克竞赛中达到金牌级别的表现，在竞赛式问题求解方面展现出卓越能力。然而，竞赛数学仅代表数学推理的狭窄维度：问题取材于有限领域，极少需要高级数学工具，且往往更依赖巧妙的技巧而非深厚的理论知识。我们推出\bench{}——一个包含25道专家精心设计问题的私有基准测试集，旨在评估人工智能系统在远超奥数前沿的研究级数学问题上的能力。这些问题由常春藤联盟数学教授、研究生及拥有博士学位的国际数学奥林匹克奖牌得主创作，其作者通常需独立花费数周时间才能完成求解。每道问题均经过两位独立领域专家的双盲验证——他们必须从零开始解决问题，并通过程序化验证器评估其产生的唯一闭式解。我们将前沿模型作为无约束的研究智能体进行评估，允许其完全使用编程工具、搜索功能及开放式推理，并基于每道问题100次独立运行的统计估计量进行无偏评估。结果显示，目前所有前沿模型的得分均低于10%，这揭示了奥数级别问题求解与真正研究级数学推理之间的巨大差距。通过保持基准测试集的完全私有性，我们确保所测性能反映的是真实的数学能力，而非对训练数据的记忆。

摘要 (Abstract)

Recent AI systems have achieved gold-medal-level performance on the International Mathematical Olympiad, demonstrating remarkable proficiency at competition-style problem solving. However, competition mathematics represents only a narrow slice of mathematical reasoning: problems are drawn from limited domains, require minimal advanced machinery, and can often reward insightful tricks over deep theoretical knowledge. We introduce \bench{}, a private benchmark of 25 expert-curated problems designed to evaluate AI systems on research-level mathematics that goes far beyond the olympiad frontier. Problems are authored by Ivy League mathematics professors, graduate students, and PhD-holding IMO medalists, and routinely took their authors weeks to solve independently. Each problem undergoes double-blind verification by two independent domain experts who must solve the problem from scratch, and yields a unique, closed-form solution assessed by programmatic verifiers. We evaluate frontier models as unconstrained research agents, with full access to coding tools, search, and open-ended reasoning, using an unbiased statistical estimator computed over 100 independent runs per problem. Our results reveal that all frontier models currently score below 10%, exposing a substantial gap between olympiad-level problem solving and genuine research-level mathematical reasoning. By keeping the benchmark fully private, we ensure that measured performance reflects authentic mathematical capability rather than memorization of training data.

关键词: mathematical reasoning, AI evaluation, research-level mathematics, benchmark, frontier models, open-ended reasoning, tool use, performance gap

96. ❌ MoBiE: Efficient Inference of Mixture of Binary Experts under Post-Training Quantization

作者: Zhixiong Zhao, Zukang Xu, Zhixuan Chen, Dawei Yang 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06798v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文MoBiE专注于MoE-based LLMs的后训练量化，核心贡献是提出首个针对MoE LLMs的二值化框架。因此，与’Large Language Models’、‘Mixture of Experts’、‘Post-training’、‘Quantization’高度相关（10分）。论文提到推理加速，与’Speculative Decoding OR Inference Acceleration’有一定关联（5分）。其他关键词如SLMs、Scaling Laws、Instruction Tuning、RAG等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对MoE-based大语言模型的高内存和计算成本问题，提出了首个后训练二值化框架MoBiE，通过减少专家冗余、改进权重重要性估计和缓解路由失真，在保持效率的同时显著提升了模型性能。

摘要翻译

基于专家混合（Mixture-of-Experts, MoE）的大型语言模型（LLMs）虽具备强大性能，但面临高昂的内存与计算成本。权重二值化能提供极致的效率，然而现有针对稠密LLMs设计的二值化方法在处理MoE特有问题上存在困难，包括跨专家冗余、任务无关的重要性评估以及量化引发的路由偏移。为此，我们提出MoBiE，首个专为基于MoE的LLMs定制的二值化框架。MoBiE建立在三项核心创新之上：1. 采用联合奇异值分解（SVD）以降低跨专家冗余；2. 将全局损失梯度整合到局部海森矩阵度量中，以增强权重重要性评估；3. 引入基于输入零空间的误差约束，以减轻路由失真。值得注意的是，MoBiE在实现这些优化的同时，未产生额外的存储开销，在效率与模型性能之间取得了平衡。大量实验表明，在多种基于MoE的LLMs和基准测试中，MoBiE始终优于最先进的二值化方法。例如，在Qwen3-30B-A3B模型上，MoBiE将困惑度降低了52.2$%$，将平均零样本性能提升了43.4$%$，实现了超过2$\times$的推理加速，并进一步缩短了量化时间。代码发布于https://github.com/Kishon-zzx/MoBiE。

摘要 (Abstract)

Mixture-of-Experts (MoE) based large language models (LLMs) offer strong performance but suffer from high memory and computation costs. Weight binarization provides extreme efficiency, yet existing binary methods designed for dense LLMs struggle with MoE-specific issues, including cross-expert redundancy, task-agnostic importance estimation, and quantization-induced routing shifts. To this end, we propose MoBiE, the first binarization framework tailored for MoE-based LLMs. MoBiE is built on three core innovations: 1. using joint SVD decomposition to reduce cross-expert redundancy; 2. integrating global loss gradients into local Hessian metrics to enhance weight importance estimation; 3. introducing an error constraint guided by the input null space to mitigate routing distortion. Notably, MoBiE achieves these optimizations while incurring no additional storage overhead, striking a balance between efficiency and model performance. Extensive experiments demonstrate that MoBiE consistently outperforms state-of-the-art binary methods across multiple MoE-based LLMs and benchmarks. For example, on Qwen3-30B-A3B, MoBiE reduces perplexity by 52.2$%$, improves average zero-shot performance by 43.4$%$, achieves over 2 $\times$ inference speedup, and further shortens quantization time. The code is available at https://github.com/Kishon-zzx/MoBiE.

关键词: Mixture-of-Experts, Large Language Models, Post-training Quantization, Weight Binarization, Inference Efficiency, Model Compression, Routing Distortion, Cross-expert Redundancy

97. ❌ Instance-Adaptive Parametrization for Amortized Variational Inference

作者: Andrea Pollastro, Andrea Apicella, Francesco Isgrò, Roberto Prevete 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06796v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是变分自编码器（VAE）中的摊销变分推断问题，提出了一种实例自适应参数化方法（IA-VAE）。虽然论文涉及深度学习技术（VAE、超网络、参数调制），但所有关键词都明确针对大语言模型（LLMs）及其相关技术（如MoE、RLHF、RAG、量化等），或特定应用领域（如AI for Science）。论文内容完全不涉及语言模型、大模型技术原理、或大模型在科学领域的应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种实例自适应变分自编码器（IA-VAE），通过超网络生成输入依赖的编码器调制来减少摊销差距，从而在合成数据和图像基准上实现了更准确的后验近似和更高的ELBO。

摘要翻译

潜变量模型，包括变分自编码器（VAE），因其可扩展性和坚实的概率建模基础，仍然是现代深度生成模型中的核心工具。这些模型依赖摊销变分推断来实现高效的后验近似，但这种效率的代价是共享参数化，从而产生了摊销差距。我们提出了实例自适应变分自编码器（IA-VAE），这是一种摊销变分推断框架，其中超网络生成共享编码器的输入相关调制。这使得推断模型能够进行输入特定的自适应，同时保持单次前向传播的效率。通过利用实例特定的参数调制，所提出的方法能够以显著更少的参数实现与标准编码器相当的性能，表明模型容量得到了更高效的利用。在真实后验已知的合成数据上的实验表明，IA-VAE能产生更准确的后验近似并减小摊销差距。同样，在标准图像基准测试中，IA-VAE相较于基线VAE持续提升了保留证据下界（ELBO），并在多次运行中取得了统计显著的增益。这些结果表明，通过实例自适应调制来增加推断参数化的灵活性，是缓解深度生成模型中因摊销导致的次优性能的关键因素。

摘要 (Abstract)

Latent variable models, including variational autoencoders (VAE), remain a central tool in modern deep generative modeling due to their scalability and a well-founded probabilistic formulation. These models rely on amortized variational inference to enable efficient posterior approximation, but this efficiency comes at the cost of a shared parametrization, giving rise to the amortization gap. We propose the instance-adaptive variational autoencoder (IA-VAE), an amortized variational inference framework in which a hypernetwork generates input-dependent modulations of a shared encoder. This enables input-specific adaptation of the inference model while preserving the efficiency of a single forward pass. By leveraging instance-specific parameter modulations, the proposed approach can achieve performance comparable to standard encoders with substantially fewer parameters, indicating a more efficient use of model capacity. Experiments on synthetic data, where the true posterior is known, show that IA-VAE yields more accurate posterior approximations and reduces the amortization gap. Similarly, on standard image benchmarks, IA-VAE consistently improves held-out ELBO over baseline VAEs, with statistically significant gains across multiple runs. These results suggest that increasing the flexibility of the inference parametrization through instance-adaptive modulation is a key factor in mitigating amortization-induced suboptimality in deep generative models.

关键词: variational autoencoder, amortized variational inference, instance-adaptive modulation, hypernetwork, posterior approximation, amortization gap, parameter efficiency, deep generative models

98. ❌ FedDAP: Domain-Aware Prototype Learning for Federated Learning under Domain Shift

作者: Huy Q. Le, Loc X. Nguyen, Yu Qiao, Seong Tae Kim, Eui-Nam Huh, Choong Seon Hong 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06795v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究联邦学习中的领域偏移问题，提出FedDAP方法构建领域感知原型。仅与关键词’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（5分），因为涉及领域适应概念，但论文聚焦联邦学习而非大模型技术。其他关键词均与大模型、深度学习技术原理或科学应用无关，得0分。

!!! tip deepseek-chat TL;DR

该论文针对联邦学习中客户端数据存在领域偏移导致全局模型性能下降的问题，提出了FedDAP方法，通过构建领域感知原型和双重对齐机制，有效提升了模型在跨域场景下的泛化能力。

摘要翻译

联邦学习（Federated Learning, FL）支持在多个客户端之间进行去中心化的模型训练，而无需暴露私有数据，因此非常适用于对隐私敏感的应用场景。然而，在实际的联邦学习场景中，客户端持有的数据通常来自不同的领域，这会导致严重的领域偏移（domain shift）并降低全局模型的性能。为解决这一问题，原型学习（prototype learning）作为一种利用类级别特征表示的解决方案应运而生，并展现出良好前景。然而，现有方法面临两个关键局限：（1）现有的基于原型的联邦学习方法通常通过聚合来自所有客户端的本地原型，为每个类别构建一个单一的全局原型，而未保留领域信息。（2）当前的特征-原型对齐是领域无关的，即强制客户端与全局原型对齐，而忽略其领域来源。为应对这些挑战，我们提出了联邦领域感知原型（Federated Domain-Aware Prototypes, FedDAP）方法，通过使用相似性加权融合机制聚合同一领域内的客户端本地原型，从而构建领域特定的全局原型。这些全局领域特定原型随后被用于指导本地训练，具体做法是将本地特征与来自同一领域的原型对齐，同时鼓励其与不同领域的原型分离。这种双重对齐增强了本地层面的领域特定学习能力，并使全局模型能够泛化到不同的领域。最后，我们在三个不同的数据集（DomainNet、Office-10 和 PACS）上进行了大量实验，以证明我们提出的框架在应对领域偏移挑战方面的有效性。代码可在 https://github.com/quanghuy6997/FedDAP 获取。

摘要 (Abstract)

Federated Learning (FL) enables decentralized model training across multiple clients without exposing private data, making it ideal for privacy-sensitive applications. However, in real-world FL scenarios, clients often hold data from distinct domains, leading to severe domain shift and degraded global model performance. To address this, prototype learning has been emerged as a promising solution, which leverages class-wise feature representations. Yet, existing methods face two key limitations: (1) Existing prototype-based FL methods typically construct a $\textit{single global prototype}$ per class by aggregating local prototypes from all clients without preserving domain information. (2) Current feature-prototype alignment is $\textit{domain-agnostic}$, forcing clients to align with global prototypes regardless of domain origin. To address these challenges, we propose Federated Domain-Aware Prototypes (FedDAP) to construct domain-specific global prototypes by aggregating local client prototypes within the same domain using a similarity-weighted fusion mechanism. These global domain-specific prototypes are then used to guide local training by aligning local features with prototypes from the same domain, while encouraging separation from prototypes of different domains. This dual alignment enhances domain-specific learning at the local level and enables the global model to generalize across diverse domains. Finally, we conduct extensive experiments on three different datasets: DomainNet, Office-10, and PACS to demonstrate the effectiveness of our proposed framework to address the domain shift challenges. The code is available at https://github.com/quanghuy6997/FedDAP.

关键词: Federated Learning, Domain Shift, Prototype Learning, Domain-Aware Prototypes, Feature-Prototype Alignment, Cross-domain Generalization, Privacy-sensitive Applications

99. ❌ Evaluating Repository-level Software Documentation via Question Answering and Feature-Driven Development

作者: Xinchen Wang, Ruida Hu, Cuiyun Gao, Pengfei Gao, Chao Peng 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06793v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是评估大模型（LLMs）在软件文档生成和理解中的应用，直接涉及’Large Language Models’关键词（10分）。论文提到SWE-Agent，属于LLM代理系统，因此’LLM Agents’得5分。其他关键词如MoE、SLMs、训练方法、推理优化、科学AI应用等均未在摘要中提及或相关，故得0分。

!!! tip deepseek-chat TL;DR

该论文针对现有软件文档评估方法在仓库级别评估和可靠性方面的不足，提出了一个名为SWD-Bench的新基准，通过功能驱动的问答任务来评估大模型基于文档理解和实现功能的能力，实验表明高质量文档能显著提升代理的问题解决率。

摘要翻译

软件文档对于代码仓库的理解至关重要。尽管大型语言模型（LLM）已推动文档生成从代码片段扩展至整个仓库级别，但现有基准测试存在两个关键局限：（1）缺乏整体性的仓库级评估；（2）依赖不可靠的评估策略，例如采用LLM-as-a-judge方法，该方法存在评估标准模糊且仓库级知识有限的问题。为解决这些问题，我们提出了SWD-Bench——一个用于评估仓库级软件文档的新型基准测试。受文档驱动开发理念的启发，我们的策略通过评估LLM利用文档理解和实现功能的能力（而非直接对文档评分）来衡量文档质量，这一过程通过功能驱动的问答任务实现。SWD-Bench包含三个相互关联的问答任务：（1）功能检测，用于判断某项功能是否被描述；（2）功能定位，用于评估定位相关文件的准确性；（3）功能实现，用于衡量实现细节的完整性。我们通过挖掘高质量拉取请求并增强其仓库级上下文，构建了包含4,170个条目的基准数据集。实验揭示了当前文档生成方法的局限性，并表明源代码具有补充价值。值得注意的是，性能最优方法生成的文档将SWE-Agent的问题解决率提升了20.00%，这证明了高质量文档在支持文档驱动开发方面的实用价值。

摘要 (Abstract)

Software documentation is crucial for repository comprehension. While Large Language Models (LLMs) advance documentation generation from code snippets to entire repositories, existing benchmarks have two key limitations: (1) they lack a holistic, repository-level assessment, and (2) they rely on unreliable evaluation strategies, such as LLM-as-a-judge, which suffers from vague criteria and limited repository-level knowledge. To address these issues, we introduce SWD-Bench, a novel benchmark for evaluating repository-level software documentation. Inspired by documentation-driven development, our strategy evaluates documentation quality by assessing an LLM’s ability to understand and implement functionalities using the documentation, rather than by directly scoring it. This is measured through function-driven Question Answering (QA) tasks. SWD-Bench comprises three interconnected QA tasks: (1) Functionality Detection, to determine if a functionality is described; (2) Functionality Localization, to evaluate the accuracy of locating related files; and (3) Functionality Completion, to measure the comprehensiveness of implementation details. We construct the benchmark, containing 4,170 entries, by mining high-quality Pull Requests and enriching them with repository-level context. Experiments reveal limitations in current documentation generation methods and show that source code provides complementary value. Notably, documentation from the best-performing method improves the issue-solving rate of SWE-Agent by 20.00%, which demonstrates the practical value of high-quality documentation in supporting documentation-driven development.

关键词: Software Documentation, Large Language Models, Repository-level Evaluation, Question Answering, Benchmark, SWD-Bench, Documentation-driven Development, SWE-Agent

100. ❌ FVD: Inference-Time Alignment of Diffusion Models via Fleming-Viot Resampling

作者: Shivanshu Shekhar, Sagnik Mukherjee, Jia Yi Zhang, Tong Zhang 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06779v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于扩散模型的推理时对齐方法，与大多数关键词无关。唯一相关的是’Alignment’（权重1.0），因为论文明确提出了’FVD: Inference-Time Alignment of Diffusion Models’，这是其核心贡献，因此给10分。其他关键词如LLMs、MoE、RLHF等均未涉及。加权总分仅基于对齐关键词计算。

!!! tip deepseek-chat TL;DR

论文提出了FVD（Fleming-Viot Diffusion），一种推理时对齐方法，解决了基于SMC的扩散采样器中常见的多样性崩溃问题，在图像生成任务中显著提升了性能并加速了推理。

摘要翻译

我们提出弗莱明-维奥扩散（Fleming-Viot Diffusion, FVD），这是一种推理阶段的对齐方法，旨在解决基于序列蒙特卡洛（Sequential Monte Carlo, SMC）的扩散采样器中常见的多样性崩溃问题。现有的基于SMC的扩散采样器通常依赖多项式重采样或与之密切相关的重采样方案，这些方法在强选择压力下仍可能降低多样性并导致谱系崩溃。受弗莱明-维奥种群动力学启发，FVD采用专为扩散对齐设计的特殊生死机制替代多项式重采样。针对奖励函数仅近似可得、且简单重生会导致确定性轨迹崩溃的情况，FVD将基于奖励的独立生存决策与随机重生噪声相结合。这形成了灵活的种群动态机制，能够在有效探索奖励偏斜分布的同时保留更广泛的轨迹支持，且无需价值函数近似或昂贵的轨迹展开计算。FVD具备完全并行化能力，可随推理计算资源高效扩展。实验结果表明，该方法在多个场景中取得显著提升：在DrawBench基准测试中，其ImageReward指标较现有方法提高7%；在类别条件生成任务中，FID指标较强基线提升约14-20%，且比基于价值函数的方法快达66倍。

摘要 (Abstract)

We introduce Fleming-Viot Diffusion (FVD), an inference-time alignment method that resolves the diversity collapse commonly observed in Sequential Monte Carlo (SMC) based diffusion samplers. Existing SMC-based diffusion samplers often rely on multinomial resampling or closely related resampling schemes, which can still reduce diversity and lead to lineage collapse under strong selection pressure. Inspired by Fleming-Viot population dynamics, FVD replaces multinomial resampling with a specialized birth-death mechanism designed for diffusion alignment. To handle cases where rewards are only approximately available and naive rebirth would collapse deterministic trajectories, FVD integrates independent reward-based survival decisions with stochastic rebirth noise. This yields flexible population dynamics that preserve broader trajectory support while effectively exploring reward-tilted distributions, all without requiring value function approximation or costly rollouts. FVD is fully parallelizable and scales efficiently with inference compute. Empirically, it achieves substantial gains across settings: on DrawBench it outperforms prior methods by 7% in ImageReward, while on class-conditional tasks it improves FID by roughly 14-20% over strong baselines and is up to 66 times faster than value-based approaches.

关键词: diffusion models, inference-time alignment, Fleming-Viot resampling, diversity collapse, Sequential Monte Carlo, reward-based survival, population dynamics, image generation

101. ❌ Multi-Faceted Self-Consistent Preference Alignment for Query Rewriting in Conversational Search

作者: Zhiyu Cao, Peifeng Li, Qiaoming Zhu 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06771v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	8.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出MSPA-CQR方法，核心是使用Direct Preference Optimization（DPO）进行多维度偏好对齐，与’Direct Preference Optimization’高度相关（10分）。研究涉及’Instruction Tuning/Alignment’（10分），因为专注于偏好对齐。使用’Post-training/Supervised Fine-tuning’（8分）进行模型优化。论文基于大模型（8分）进行对话查询重写，涉及检索增强生成（8分）和自一致性改进（8分）。其他关键词如MoE、量化、科学AI等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对对话查询重写中忽略多维度反馈的问题，提出了基于多维度自一致性偏好对齐的MSPA-CQR方法，通过DPO优化显著提升了重写效果。

摘要翻译

会话查询重写（Conversational Query Rewriting, CQR）旨在通过重写模糊查询以实现更高效的会话搜索。早期研究主要集中于孤立的重写过程，忽视了在重写过程中来自查询重写、段落检索与响应生成的反馈。为解决这一问题，我们提出了多维度自洽偏好对齐的会话查询重写方法（Multi-Faceted Self-Consistent Preference Aligned CQR, MSPA-CQR）。具体而言，我们首先从三个维度（重写、检索与响应）构建自洽偏好对齐数据，以生成更多样化的重写查询。随后，我们提出前缀引导的多维度直接偏好优化方法，以从三个不同维度学习偏好信息。实验结果表明，我们的MSPA-CQR方法在分布内与分布外场景中均表现有效。

摘要 (Abstract)

Conversational Query Rewriting (CQR) aims to rewrite ambiguous queries to achieve more efficient conversational search. Early studies have predominantly focused on the rewriting in isolation, ignoring the feedback from query rewrite, passage retrieval and response generation in the rewriting process. To address this issue, we propose Multi-Faceted Self-Consistent Preference Aligned CQR (MSPA-CQR). Specifically, we first construct self-consistent preference alignment data from three dimensions (rewriting, retrieval, and response) to generate more diverse rewritten queries. Then we propose prefix guided multi-faceted direct preference optimization to learn preference information from three different dimensions. The experimental results show that our MSPA-CQR is effective in both in- and out-of-distribution scenarios.

关键词: Conversational Query Rewriting, Preference Alignment, Direct Preference Optimization, Multi-faceted Learning, Self-consistent Data, Retrieval-Augmented Generation, Query Rewriting, Conversational Search

102. ❌ FlowExtract: Procedural Knowledge Extraction from Maintenance Flowcharts

作者: Guillermo Gil de Avalle, Laura Maruster, Eric Sloot, Christos Emmanouilidis 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06770v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文主要研究从流程图提取程序性知识的技术，使用YOLOv8和EasyOCR等计算机视觉方法，不涉及大模型、深度学习技术原理或科学领域应用，仅与’AI for Science’有微弱关联（工业维护可视为应用领域），其他关键词均无关。

!!! tip deepseek-chat TL;DR

该论文提出FlowExtract系统，解决了从工业维护流程图中提取可查询程序性知识的问题，通过分离元素检测和连接重建，显著优于视觉语言模型基线。

摘要翻译

制造设施中的维护流程通常以静态PDF或扫描图像中的流程图形式记录。这些流程图编码了对资产生命周期管理至关重要的程序性知识，但现代操作员支持系统却无法直接访问。作为图像理解主流范式的视觉语言模型，难以从此类图表中重建连接拓扑结构。本文提出FlowExtract——一种从符合ISO 5807标准的流程图中提取有向图的处理流程。该系统将元素检测与连接关系重建分离：采用YOLOv8和EasyOCR进行标准域对齐的节点检测与文本提取，同时结合一种创新的边缘检测方法，通过分析箭头方向并沿连接线逆向追踪至源节点。在工业故障排除指南上的评估表明，FlowExtract实现了极高的节点检测精度，并在边缘提取方面显著优于视觉语言模型基线，为组织构建可查询的程序性知识表征提供了实用路径。该实现可通过https://github.com/guille-gil/FlowExtract获取。

摘要 (Abstract)

Maintenance procedures in manufacturing facilities are often documented as flowcharts in static PDFs or scanned images. They encode procedural knowledge essential for asset lifecycle management, yet inaccessible to modern operator support systems. Vision-language models, the dominant paradigm for image understanding, struggle to reconstruct connection topology from such diagrams. We present FlowExtract, a pipeline for extracting directed graphs from ISO 5807-standardized flowcharts. The system separates element detection from connectivity reconstruction, using YOLOv8 and EasyOCR for standard domain-aligned node detection and text extraction, combined with a novel edge detection method that analyzes arrowhead orientations and traces connecting lines backward to source nodes. Evaluated on industrial troubleshooting guides, FlowExtract achieves very high node detection and substantially outperforms vision-language model baselines on edge extraction, offering organizations a practical path toward queryable procedural knowledge representations. The implementation is available athttps://github.com/guille-gil/FlowExtract.

关键词: procedural knowledge extraction, maintenance flowcharts, directed graphs, YOLOv8, EasyOCR, edge detection, industrial troubleshooting, vision-language models

103. ❌ TeamLLM: A Human-Like Team-Oriented Collaboration Framework for Multi-Step Contextualized Tasks

作者: Xiangyu Wang, Jin Wu, Haoran Shi, Wei Xia, Jiarui Yu, Chanjin Zheng 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06765v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究多LLM协作框架（TeamLLM）用于解决多步骤情境化任务，与’Large Language Models’高度相关（10分），涉及’LLM Agents’和’Multi-agent Systems’（各10分），任务需要多步推理与’Chain of Thought’高度相关（10分），‘System 2 Thinking’有一定关联（5分），其余关键词如MoE、SFT、RAG等未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文针对现有多LLM框架缺乏人类团队角色分工导致单视角问题，提出了TeamLLM框架，通过四角色分工和三阶段协作显著提升了多步骤情境化任务的性能。

摘要翻译

近期，多大型语言模型（Multi-Large Language Model, LLM）框架被提出以解决情境化任务。然而，这些框架并未明确模拟人类团队中的角色分工，这可能导致视角单一，从而削弱在多步骤情境化任务中的表现。为解决这一问题，我们提出了TeamLLM，一种类人的面向团队的多LLM协作框架。TeamLLM采用四种具有明确分工的团队角色，并针对多步骤情境化任务实施三阶段的多LLM协作。为评估TeamLLM在多步骤情境化任务上的有效性，我们提出了情境化与过程结构化任务（Contextually-Grounded and Procedurally-Structured tasks, CGPST），并构建了CGPST基准测试。该基准测试具备四个核心特征：情境化基础、过程结构化、面向过程的评估以及多维度测评。我们在整体层面、步骤层面和维度层面对十种主流LLM在CGPST上进行了评估。结果表明，TeamLLM在CGPST上的性能显著提升。我们公开了包含场景、十种LLM的完整流程响应及人工评分的基准测试数据。代码与数据可在https://anonymous.4open.science/r/TeamLLM-anonymous-C50E/获取。

摘要 (Abstract)

Recently, multi-Large Language Model (LLM) frameworks have been proposed to solve contextualized tasks. However, these frameworks do not explicitly emulate human team role division, which may lead to a single perspective, thereby weakening performance on multi-step contextualized tasks. To address this issue, we propose TeamLLM, a human-like Team-Oriented Multi-LLM Collaboration Framework. TeamLLM adopts four team roles with distinct division and employs a three-phase multi-LLM collaboration for multi-step contextualized tasks. To evaluate the effectiveness of TeamLLM on multi-step contextualized tasks, we propose Contextually-Grounded and Procedurally-Structured tasks (CGPST) and construct the CGPST benchmark. This benchmark has four core features: contextual grounding, procedural structure, process-oriented evaluation and multi-dimensional assessment. We evaluate ten popular LLMs on CGPST at overall-level, step-level, and dimension-level. Results show that TeamLLM substantially improves performance on CGPST. We release the benchmark with scenarios, full-process responses and human scores from ten LLMs. The code and data are available at https://anonymous.4open.science/r/TeamLLM-anonymous-C50E/.

关键词: TeamLLM, multi-LLM collaboration, human-like team roles, multi-step contextualized tasks, CGPST benchmark, process-oriented evaluation, agent coordination, contextual grounding

104. ❌ Evaluating LLM-Based 0-to-1 Software Generation in End-to-End CLI Tool Scenarios

作者: Ruida Hu, Xinchen Wang, Chao Peng, Cuiyun Gao, David Lo 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06742v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在软件生成中的应用，特别是评估LLM代理从零开始生成完整CLI工具的能力。因此，与’Large Language Models’和’LLM Agents’高度相关（10分）。其他关键词如MoE、量化、推理加速、对齐技术等，论文未涉及，故得0分。

!!! tip deepseek-chat TL;DR

该论文研究了LLM代理从零开始生成完整命令行工具的能力，并引入CLI-Tool-Bench基准进行评估，发现顶级模型成功率低于43%，且更高token消耗不保证更好性能。

摘要翻译

大型语言模型（LLM）正在推动意图驱动开发的转变，即智能体从零开始构建完整软件。然而，现有基准测试因两个局限而无法评估这种从0到1的生成能力：一是依赖忽略仓库结构规划的预定义脚手架，二是采用缺乏端到端行为验证的僵化白盒单元测试。为弥补这一差距，我们引入了CLI-Tool-Bench——一个用于评估命令行界面（CLI）工具从零生成能力的、与结构无关的基准测试。该基准包含100个多样化的真实世界仓库，并通过黑盒差分测试框架进行评估。智能体生成的软件在沙箱中执行，使用多层等价度量标准，将其系统副作用和终端输出与人工编写的参考实现进行对比。通过对七个前沿大型语言模型的评估，我们发现顶级模型的成功率低于43%，凸显了从0到1生成能力仍面临持续挑战。此外，更高的令牌消耗并不能保证更好的性能，且智能体倾向于生成单体式代码。

摘要 (Abstract)

Large Language Models (LLMs) are driving a shift towards intent-driven development, where agents build complete software from scratch. However, existing benchmarks fail to assess this 0-to-1 generation capability due to two limitations: reliance on predefined scaffolds that ignore repository structure planning, and rigid white-box unit testing that lacks end-to-end behavioral validation. To bridge this gap, we introduce CLI-Tool-Bench, a structure-agnostic benchmark for evaluating the ground-up generation of Command-Line Interface (CLI) tools. It features 100 diverse real-world repositories evaluated via a black-box differential testing framework. Agent-generated software is executed in sandboxes, comparing system side effects and terminal outputs against human-written oracles using multi-tiered equivalence metrics. Evaluating seven state-of-the-art LLMs, we reveal that top models achieve under 43% success, highlighting the ongoing challenge of 0-to-1 generation. Furthermore, higher token consumption does not guarantee better performance, and agents tend to generate monolithic code.

关键词: Large Language Models, LLM Agents, Software Generation, CLI Tools, Benchmark Evaluation, 0-to-1 Generation, Intent-driven Development, Differential Testing

105. ❌ Luwen Technical Report

作者: Yiquan Wu, Yuhang Liu, Yifei Liu, Ang Li, Siying Zhou, Kun Kuang 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06737v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文明确涉及大语言模型（LLMs）在特定领域的应用，与关键词1高度相关（10分）。论文采用持续预训练（Continual Pre-training）和监督微调（Supervised Fine-tuning）技术，与关键词5和6高度相关（各10分）。论文集成了检索增强生成（RAG）技术，与关键词10高度相关（10分）。其他关键词如MoE、SLMs、RLHF、量化等未在摘要中提及，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文研究了如何将通用大语言模型适应到法律领域，通过持续预训练、监督微调和检索增强生成技术构建了中文法律语言模型Luwen，并在多个法律任务上取得了优于基线模型的效果。

摘要翻译

大语言模型已在广泛自然语言处理任务中展现出卓越能力，但由于法律领域涉及专业术语、复杂推理需求以及快速演进的法律知识，其在该领域的应用仍面临挑战。本文提出Luwen——一个基于百川基础模型构建的开源中文法律语言模型，其通过三项关键技术实现：基于大规模法律语料的持续预训练、使用精心构建的法律指令数据进行监督微调，以及集成全面法律知识库的检索增强生成。我们在涵盖预测与生成场景的五项代表性法律任务上评估Luwen，包括法律判决预测、司法考试、法律文本摘要、法条问答及司法决策推理。实验结果表明，Luwen在多项任务上超越现有基线模型，验证了我们将通用语言模型适配至法律领域方法的有效性。

摘要 (Abstract)

Large language models have demonstrated remarkable capabilities across a wide range of natural language processing tasks, yet their application in the legal domain remains challenging due to the specialized terminology, complex reasoning requirements, and rapidly evolving legal knowledge involved. In this paper, we present Luwen, an open-source Chinese legal language model built upon the Baichuan foundation model through three key techniques: continual pre-training on a large-scale legal corpus, supervised fine-tuning with carefully curated legal instruction data, and retrieval-augmented generation integrated with a comprehensive legal knowledge base. We evaluate Luwen on five representative legal tasks spanning both prediction and generation settings, including legal judgment prediction, judicial examination, legal text summarization, law article question answering, and judicial decision reasoning. Experimental results show that Luwen outperforms several strong baselines, demonstrating the effectiveness of our approach in adapting general-purpose language models to the legal domain.

关键词: Large language models, Legal domain, Continual pre-training, Supervised fine-tuning, Retrieval-augmented generation, Chinese legal language model, Legal tasks, Domain adaptation

106. ❌ URMF: Uncertainty-aware Robust Multimodal Fusion for Multimodal Sarcasm Detection

作者: Zhenyu Wang, Weichen Cheng, Weijia Li, Junjie Mou, Zongyou Zhao, Guoying Zhang 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06728v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于多模态讽刺检测（MSD），提出了一种不确定性感知的融合框架（URMF）。虽然使用了多模态融合和不确定性建模，但研究内容与所有评分关键词（主要关于大模型技术原理、训练方法、推理优化、对齐、压缩、代理系统等）均无直接关联。论文未涉及大模型、深度学习技术原理创新，也未在大模型的不同领域应用方面有贡献，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对多模态讽刺检测中模态可靠性不均的问题，提出了不确定性感知的鲁棒多模态融合框架（URMF），通过建模模态不确定性动态调节融合权重，在公开基准上取得了优于现有方法的准确性和鲁棒性。

摘要翻译

多模态讽刺检测旨在通过文本与图像间的语义不协调性识别讽刺意图。尽管现有方法通过跨模态交互与不协调推理提升了检测性能，但它们通常默认所有模态具有同等可靠性。然而在实际社交媒体场景中，文本内容可能具有歧义性，视觉内容可能关联微弱甚至完全无关，这种确定性融合机制会引入噪声证据并削弱推理的鲁棒性。为解决该问题，我们提出不确定性感知的鲁棒多模态融合框架——一种在交互与融合过程中显式建模模态可靠性的统一架构。该框架首先采用多头交叉注意力机制将视觉证据注入文本表征，随后在融合语义空间中进行多头自注意力计算以增强不协调感知推理。接着，通过将文本、图像及交互感知隐表征分别参数化为可学习的高斯后验分布，实现对单模态认知不确定性的统一建模。估计得到的不确定性进一步用于动态调节融合过程中的模态贡献度，抑制不可靠模态并生成更鲁棒的联合表征。此外，我们设计了融合任务监督、模态先验正则化、跨模态分布对齐及不确定性驱动自采样对比学习的联合训练目标。在公开多模态讽刺检测基准上的实验表明，该框架持续优于强单模态、多模态及基于大语言视觉模型的基础方法，验证了不确定性感知融合对提升检测精度与鲁棒性的有效性。

摘要 (Abstract)

Multimodal sarcasm detection (MSD) aims to identify sarcastic intent from semantic incongruity between text and image. Although recent methods have improved MSD through cross-modal interaction and incongruity reasoning, they often assume that all modalities are equally reliable. In real-world social media, however, textual content may be ambiguous and visual content may be weakly relevant or even irrelevant, causing deterministic fusion to introduce noisy evidence and weaken robust reasoning. To address this issue, we propose Uncertainty-aware Robust Multimodal Fusion (URMF), a unified framework that explicitly models modality reliability during interaction and fusion. URMF first employs multi-head cross-attention to inject visual evidence into textual representations, followed by multi-head self-attention in the fused semantic space to enhance incongruity-aware reasoning. It then performs unified unimodal aleatoric uncertainty modeling over text, image, and interaction-aware latent representations by parameterizing each modality as a learnable Gaussian posterior. The estimated uncertainty is further used to dynamically regulate modality contributions during fusion, suppressing unreliable modalities and yielding a more robust joint representation. In addition, we design a joint training objective integrating task supervision, modality prior regularization, cross-modal distribution alignment, and uncertainty-driven self-sampling contrastive learning. Experiments on public MSD benchmarks show that URMF consistently outperforms strong unimodal, multimodal, and MLLM-based baselines, demonstrating the effectiveness of uncertainty-aware fusion for improving both accuracy and robustness.

关键词: Multimodal Sarcasm Detection, Uncertainty-aware Fusion, Cross-modal Interaction, Aleatoric Uncertainty, Robust Reasoning, Multimodal Fusion, Modality Reliability, Self-sampling Contrastive Learning

107. ❌ The Traveling Thief Problem with Time Windows: Benchmarks and Heuristics

作者: Helen Yuliana Angmalisang, Frank Neumann 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06724v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究旅行小偷问题（TTP）与时间窗口约束的组合优化问题，属于经典运筹学领域，完全不涉及大模型、深度学习、AI技术原理或科学AI应用，与所有评分关键词均无任何关联。

!!! tip deepseek-chat TL;DR

该论文研究了带时间窗口约束的旅行小偷问题，提出了新的启发式算法并创建了基准测试集，实验表明新算法在多种实例上优于现有方法。

摘要翻译

传统优化问题往往被孤立研究，而当今许多现实问题需要多个优化组件间的相互依赖。旅行窃贼问题（Traveling Thief Problem, TTP）是一个在文献中被广泛研究的多组件问题。本文引入并研究了带时间窗约束的TTP，该变体与现实场景高度相关——物品只能在特定时间区间内被收集。我们考察了现有TTP方法及带时间窗的旅行商问题（Traveling Salesperson Problem, TSP）方法在此新问题上的适应性，并评估了它们的性能。此外，我们提出了一种针对带时间窗TTP的新启发式方法。为评估带时间窗TTP的算法，我们基于文献中已有的TTP实例，引入了一套新的带时间窗TTP基准测试集。实验研究评估了不同方法，结果表明新设计的算法在广泛的基准测试实例上优于其他方法。

摘要 (Abstract)

While traditional optimization problems were often studied in isolation, many real-world problems today require interdependence among multiple optimization components. The traveling thief problem (TTP) is a multi-component problem that has been widely studied in the literature. In this paper, we introduce and investigate the TTP with time window constraints which provides a TTP variant highly relevant to real-world situations where good can only be collected at given time intervals. We examine adaptions of existing approaches for TTP and the Traveling Salesperson Problem (TSP) with time windows to this new problem and evaluate their performance. Furthermore, we provide a new heuristic approach for the TTP with time windows. To evaluate algorithms for TTP with time windows, we introduce new TTP benchmark instances with time windows based on TTP instances existing in the literature. Our experimental investigations evaluate the different approaches and show that the newly designed algorithm outperforms the other approaches on a wide range of benchmark instances.

关键词: Traveling Thief Problem, time windows, multi-component optimization, heuristic algorithm, benchmark instances, TSP with time windows, interdependent optimization

108. ❌ Fine-grained Approaches for Confidence Calibration of LLMs in Automated Code Revision

作者: Hong Yi Lin, Chunhua Liu, Haoyu Gao, Patanamon Thongtanunam, Christoph Treude 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06723v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在代码修订任务中的置信度校准问题，与’Large Language Models’高度相关（10分），因为全文围绕LLM在软件工程中的应用展开。与’Hallucination Mitigation’和’Mechanistic Interpretability’有一定关联（各5分），因为置信度校准旨在减少错误输出并提高模型可解释性。其他关键词如MoE、SFT、RAG等未在论文中涉及，均为0分。

!!! tip deepseek-chat TL;DR

该论文针对LLM在自动化代码修订任务中置信度校准不可靠的问题，提出了细粒度的Platt-scaling方法，实验证明该方法能显著降低校准误差，提高模型输出的可信度。

摘要翻译

在当今人工智能辅助软件工程领域，开发者日益依赖能力强大但本质上并不完美的大型语言模型。这些模型产生错误输出的倾向可能降低开发者的工作效率。为此，一种典型的缓解方法是提供经过校准的置信度分数，使其在实例层面真实反映输出正确的可能性。此类信息使用户能够就输出接受度做出即时决策，规避易错输出，并使其预期与模型能力更好对齐。由于经过后训练的大型语言模型本身无法产生校准良好的置信度分数，研究者已开发出多种事后校准方法，其中序列级置信度分数的全局普拉特缩放方法已在许多生成式软件工程任务中被证明有效，但在自动代码修订任务（如程序修复、漏洞修复和代码优化）中仍存在不可靠或未被充分探索的问题。我们假设这种传统方法的粗粒度特性使其不适用于自动代码修订任务，因为此类任务的正确性通常取决于局部编辑决策，且错误校准可能具有样本依赖性，这促使我们探索细粒度置信度校准方法。为此，本研究提出将局部普拉特缩放分别应用于三种不同的细粒度置信度分数。通过对3项独立任务与正确性指标、以及14种不同规模模型的实验，我们发现细粒度置信度分数能在更广泛的概率区间内持续获得更低的校准误差，且当应用全局普拉特缩放时，这种效果会进一步增强。我们提出的方法为获取良好校准的置信度分数提供了实用解决方案，使得在自动代码修订任务中能够更可信、更高效地使用不完美的模型。

摘要 (Abstract)

In today’s AI-assisted software engineering landscape, developers increasingly depend on LLMs that are highly capable, yet inherently imperfect. The tendency of these models to produce incorrect outputs can reduce developer productivity. To this end, a canonical mitigation method is to provide calibrated confidence scores that faithfully reflect their likelihood of correctness at the instance-level. Such information allows users to make immediate decisions regarding output acceptance, abstain error-prone outputs, and better align their expectations with the model’s capabilities. Since post-trained LLMs do not inherently produce well-calibrated confidence scores, researchers have developed post-hoc calibration methods, with global Platt-scaling of sequence-level confidence scores proving effective in many generative software engineering tasks but remaining unreliable or unexplored for automated code revision (ACR) tasks such as program repair, vulnerability repair, and code refinement. We hypothesise that the coarse-grained nature of this conventional method makes it ill-suited for ACR tasks, where correctness is often determined by local edit decisions and miscalibration can be sample-dependent, thereby motivating fine-grained confidence calibration. To address this, our study proposes local Platt-scaling applied separately to three different fine-grained confidence scores. Through experiments across 3 separate tasks and correctness metrics, as well as 14 different models of various sizes, we find that fine-grained confidence scores consistently achieve lower calibration error across a broader range of probability intervals, and this effect is further amplified when global Platt-scaling is applied. Our proposed approaches offer a practical solution to eliciting well-calibrated confidence scores, enabling more trustworthy and streamlined usage of imperfect models in ACR tasks.

关键词: LLM confidence calibration, automated code revision, fine-grained calibration, Platt-scaling, calibration error, program repair, software engineering, post-hoc calibration

109. ❌ HQF-Net: A Hybrid Quantum-Classical Multi-Scale Fusion Network for Remote Sensing Image Segmentation

作者: Md Aminur Hossain, Ayush V. Patel, Siddhant Gole, Sanjay K. Singh, Biplab Banerjee 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06715v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	8.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文提出了一种用于遥感图像分割的混合量子-经典多尺度融合网络（HQF-Net），其中包含一个量子瓶颈与专家混合（QMoE）模块，该模块在自适应路由机制中结合了互补的局部、全局和定向量子电路，这与’Mixture of Experts OR MoE OR Sparse Models’关键词高度相关，评分为8分。此外，论文属于遥感图像分析领域，可视为’AI for Science’在遥感科学中的应用，评分为8分。其他关键词主要涉及大语言模型（LLMs）及其相关技术（如训练、对齐、推理优化、智能体等），而本文专注于计算机视觉中的图像分割，未涉及任何语言模型或相关技术，因此其他关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种混合量子-经典多尺度融合网络（HQF-Net），通过集成量子增强的跳跃连接和量子瓶颈与专家混合模块，显著提升了遥感图像语义分割的精度，在多个基准数据集上取得了优异的性能。

摘要翻译

遥感语义分割需要能够同时捕捉复杂场景中精细空间细节与高层语义上下文的模型。尽管经典的编码器-解码器架构（如U-Net）仍是强基准模型，但它们往往难以充分利用全局语义和结构化特征交互。本文提出HQF-Net，一种用于遥感图像分割的混合量子-经典多尺度融合网络。该网络通过可变形多尺度交叉注意力融合模块，将冻结的DINOv3 ViT-L/16骨干网络提取的多尺度语义引导与定制化U-Net架构相集成。为增强特征优化，该框架进一步引入量子增强跳跃连接与混合专家量子瓶颈模块，后者通过自适应路由机制融合了互补的局部、全局和定向量子电路。在三个遥感基准数据集上的实验表明，所提设计带来了持续的性能提升：HQF-Net在LandCover.ai数据集上达到0.8568平均交并比和96.87%总体精度，在OpenEarthMap数据集上达到71.82%平均交并比，在SeasoNet数据集上达到55.28%平均交并比及99.37%总体精度。架构消融实验进一步验证了各主要组件的贡献。这些结果表明，在近期量子计算资源受限条件下，结构化的混合量子-经典特征处理是提升遥感语义分割性能的可行方向。

摘要 (Abstract)

Remote sensing semantic segmentation requires models that can jointly capture fine spatial details and high-level semantic context across complex scenes. While classical encoder-decoder architectures such as U-Net remain strong baselines, they often struggle to fully exploit global semantics and structured feature interactions. In this work, we propose HQF-Net, a hybrid quantum-classical multi-scale fusion network for remote sensing image segmentation. HQF-Net integrates multi-scale semantic guidance from a frozen DINOv3 ViT-L/16 backbone with a customized U-Net architecture through a Deformable Multiscale Cross-Attention Fusion (DMCAF) module. To enhance feature refinement, the framework further introduces quantum-enhanced skip connections (QSkip) and a Quantum bottleneck with Mixture-of-Experts (QMoE), which combines complementary local, global, and directional quantum circuits within an adaptive routing mechanism. Experiments on three remote sensing benchmarks show consistent improvements with the proposed design. HQF-Net achieves 0.8568 mIoU and 96.87% overall accuracy on LandCover.ai, 71.82% mIoU on OpenEarthMap, and 55.28% mIoU with 99.37% overall accuracy on SeasoNet. An architectural ablation study further confirms the contribution of each major component. These results show that structured hybrid quantum-classical feature processing is a promising direction for improving remote sensing semantic segmentation under near-term quantum constraints.

关键词: Hybrid Quantum-Classical, Multi-scale Fusion, Remote Sensing Image Segmentation, Quantum-enhanced Skip Connections, Quantum bottleneck with Mixture-of-Experts, Semantic Segmentation, Deformable Multiscale Cross-Attention Fusion, DINOv3 ViT-L/16

110. ❌ ATANT: An Evaluation Framework for AI Continuity

作者: Samuel Sameer Tanguturi 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06710v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	8.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	5.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出ATANT框架，用于评估AI系统的连续性（记忆、上下文管理能力），与RAG（Retrieval-Augmented Generation）高度相关（8分），因为论文明确提到RAG pipelines作为现有记忆组件；与LLMs/Foundation Models（5分）和Long Context LLMs（5分）有一定关联，因为框架可应用于包含这些技术的系统，但论文本身不聚焦于模型技术；其他关键词（如MoE、SFT、RLHF等）与论文评估框架的主题无直接关系，均给0分。

!!! tip deepseek-chat TL;DR

该论文提出了ATANT评估框架，用于测量AI系统在跨时间维护、更新和重构有意义上下文方面的连续性能力，并通过250个故事的测试语料验证了参考实现从58%提升到96-100%的准确性。

摘要翻译

我们提出ATANT（叙事真实性自动测试框架），这是一个用于衡量人工智能系统连续性（即跨时间维持、更新、消歧和重构有意义语境的能力）的开放式评估框架。尽管AI行业已开发出多种记忆组件（如RAG流水线、向量数据库、长上下文窗口、用户画像层），但尚未有公开框架正式定义或评估这些组件是否产生真正的连续性。我们将连续性定义为包含7项必要特性的系统属性，引入无需在评估回路中使用大语言模型的10检查点评估方法，并构建了一个包含250个故事的叙事测试语料库，涵盖6个生活领域共计1,835个验证问题。通过对参考实现进行5轮测试套件迭代评估，其表现从初始架构的58%逐步提升至独立模式（250个故事）的100%，50个故事累积模式的100%，以及在250个故事累积规模下达到96%。累积测试结果是核心衡量标准：当250个独立的人生叙事共存于同一数据库时，系统必须为正确语境检索准确事实且避免交叉污染。ATANT具有系统无关性和模型独立性，可作为构建与验证连续性系统的序列化方法。框架规范、示例故事及评估协议详见https://github.com/Kenotic-Labs/ATANT。完整的250个故事语料库将逐步开源发布。

摘要 (Abstract)

We present ATANT (Automated Test for Acceptance of Narrative Truth), an open evaluation framework for measuring continuity in AI systems: the ability to persist, update, disambiguate, and reconstruct meaningful context across time. While the AI industry has produced memory components (RAG pipelines, vector databases, long context windows, profile layers), no published framework formally defines or measures whether these components produce genuine continuity. We define continuity as a system property with 7 required properties, introduce a 10-checkpoint evaluation methodology that operates without an LLM in the evaluation loop, and present a narrative test corpus of 250 stories comprising 1,835 verification questions across 6 life domains. We evaluate a reference implementation across 5 test suite iterations, progressing from 58% (legacy architecture) to 100% in isolated mode (250 stories) and 100% in 50-story cumulative mode, with 96% at 250-story cumulative scale. The cumulative result is the primary measure: when 250 distinct life narratives coexist in the same database, the system must retrieve the correct fact for the correct context without cross-contamination. ATANT is system-agnostic, model-independent, and designed as a sequenced methodology for building and validating continuity systems. The framework specification, example stories, and evaluation protocol are available at https://github.com/Kenotic-Labs/ATANT. The full 250-story corpus will be released incrementally.

关键词: continuity evaluation, AI systems, memory components, RAG pipelines, long context windows, narrative test corpus, context retrieval, system-agnostic framework

111. ❌ Reasoning Fails Where Step Flow Breaks

作者: Xiaoyu Xu, Yulan Pan, Xiaosong Yuan, Zhihong Shen, Minghao Su, Yuanhao Su, Xiaofeng Zhang 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06695v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究大型推理模型（LRMs）在数学、科学和编程任务中的多步推理行为，核心关注推理过程中的信息流失败模式（Shallow Lock-in和Deep Decay）及其修复方法。高度相关的关键词包括：‘Large Language Models’（论文明确研究LRMs）、‘Chain of Thought’（研究多步推理轨迹）、‘System 2 Thinking’（涉及深度推理分析）、‘Mechanistic Interpretability’（提出Step-Saliency工具进行可解释性分析）。‘Self-Correction’和’AI for Science’有一定关联，因为论文涉及推理性能修复并在科学任务上测试。其他关键词与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文研究了大型推理模型在多步推理任务中存在的两种信息流失败模式（浅层锁定和深层衰减），并提出了一种无需重新训练的干预方法StepFlow来修复这些模式，从而在数学、科学和编程任务上提高了模型的准确性。

摘要翻译

生成长链思维的大型推理模型（LRMs）目前在多步骤数学、科学和编程任务中表现良好。然而，其行为仍不稳定且难以解释，现有分析工具难以处理此类长而结构化的推理轨迹。我们提出步骤显著性（Step-Saliency）方法，该方法将注意力-梯度分数沿问题-思考-总结轨迹汇聚为步骤间映射图。在多个模型中，步骤显著性揭示了两种反复出现的信息流故障：浅层锁定（Shallow Lock-in），即浅层网络过度聚焦当前步骤而几乎忽略早期上下文；以及深层衰减（Deep Decay），即深层网络对思考片段的显著性逐渐丧失，且总结部分越来越多地关注自身及最后几个步骤。受这些模式启发，我们提出StepFlow——一种基于显著性启发的测试时干预方法，它通过等概率桥接（Odds-Equal Bridge）调整步骤显著性测量的浅层模式，并通过步骤动量注入（Step Momentum Injection）在深层网络添加微小的步骤级残差。StepFlow在无需重新训练的情况下，提升了多种大型推理模型在数学、科学和编程任务上的准确率，这表明修复信息流能够恢复其缺失的部分推理性能。

摘要 (Abstract)

Large reasoning models (LRMs) that generate long chains of thought now perform well on multi-step math, science, and coding tasks. However, their behavior is still unstable and hard to interpret, and existing analysis tools struggle with such long, structured reasoning traces. We introduce Step-Saliency, which pools attention–gradient scores into step-to-step maps along the question–thinking–summary trajectory. Across several models, Step-Saliency reveals two recurring information-flow failures: Shallow Lock-in, where shallow layers over-focus on the current step and barely use earlier context, and Deep Decay, where deep layers gradually lose saliency on the thinking segment and the summary increasingly attends to itself and the last few steps. Motivated by these patterns, we propose StepFlow, a saliency-inspired test-time intervention that adjusts shallow saliency patterns measured by Step-Saliency via Odds-Equal Bridge and adds a small step-level residual in deep layers via Step Momentum Injection. StepFlow improves accuracy on math, science, and coding tasks across multiple LRMs without retraining, indicating that repairing information flow can recover part of their missing reasoning performance.

关键词: Large reasoning models, Step-Saliency, Information-flow failures, Shallow Lock-in, Deep Decay, StepFlow, Reasoning performance, Multi-step tasks

112. ❌ AgentGate: A Lightweight Structured Routing Engine for the Internet of Agents

作者: Yujun Cheng, Enfang Cui, Hao Qin, Zhiyuan Liang, Qi Xu 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06696v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	10.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究AgentGate路由引擎，用于物联网代理系统中的请求分发。高度相关的关键词包括：1) ‘Small Language Models OR SLMs OR On-device AI’（10分）- 论文明确使用3B-7B紧凑模型，强调资源受限部署；2) ‘Post-training OR Supervised Fine-tuning OR SFT’（10分）- 开发了面向路由的微调方案；3) ‘LLM Agents OR Autonomous Agents OR Agentic Workflow’（10分）- 研究代理系统；4) ‘Multi-agent Systems OR Agent Coordination’（10分）- 涉及多代理规划与协调；5) ‘Large Language Models OR LLMs OR Foundation Models’（8分）- 使用开放权重模型作为基础。其他关键词如MoE、Scaling Laws、RLHF等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了AgentGate，一个轻量级结构化路由引擎，用于在资源受限的物联网代理系统中高效分发请求，并通过面向路由的微调方案使紧凑模型（3B-7B）在路由任务上达到有竞争力的性能。

摘要翻译

人工智能代理系统的快速发展正在催生一个新兴的“代理互联网”，其中专业化的代理在本地设备、边缘节点、私有服务和云平台之间运行。尽管近期的研究已改进了代理的命名、发现与交互机制，但在延迟、隐私和成本约束下，高效请求调度仍是一个悬而未决的系统性问题。本文提出AgentGate，一个面向候选感知代理调度的轻量级结构化路由引擎。与将路由视为无限制文本生成不同，AgentGate将其建模为一个约束决策问题，并将其分解为两个阶段：行动决策与结构化落地。第一阶段决定查询应触发单代理调用、多代理规划、直接响应还是安全升级；第二阶段则将选定行动实例化为可执行输出，例如目标代理、结构化参数或多步骤计划。为使紧凑模型适应此场景，我们进一步开发了一种面向路由的微调方案，该方案结合了候选感知监督与困难负例。在基于多个3B至7B开放权重模型构建的路由基准测试中，实验表明紧凑模型在受限环境下能提供有竞争力的路由性能，且模型差异主要体现在行动预测、候选选择与结构化落地质量上。这些结果表明，结构化路由是实现高效且注重隐私的代理系统的可行设计点，尤其适用于必须在资源受限部署条件下做出路由决策的场景。

摘要 (Abstract)

The rapid development of AI agent systems is leading to an emerging Internet of Agents, where specialized agents operate across local devices, edge nodes, private services, and cloud platforms. Although recent efforts have improved agent naming, discovery, and interaction, efficient request dispatch remains an open systems problem under latency, privacy, and cost constraints. In this paper, we present AgentGate, a lightweight structured routing engine for candidate-aware agent dispatch. Instead of treating routing as unrestricted text generation, AgentGate formulates it as a constrained decision problem and decomposes it into two stages: action decision and structural grounding. The first stage determines whether a query should trigger single-agent invocation, multi-agent planning, direct response, or safe escalation, while the second stage instantiates the selected action into executable outputs such as target agents, structured arguments, or multi-step plans. To adapt compact models to this setting, we further develop a routing-oriented fine-tuning scheme with candidate-aware supervision and hard negative examples. Experiments on a curated routing benchmark with several 3B–7B open-weight models show that compact models can provide competitive routing performance in constrained settings, and that model differences are mainly reflected in action prediction, candidate selection, and structured grounding quality. These results indicate that structured routing is a feasible design point for efficient and privacy-aware agent systems, especially when routing decisions must be made under resource-constrained deployment conditions.

关键词: Agent routing, Internet of Agents, structured routing engine, compact models, multi-agent planning, fine-tuning, resource-constrained deployment, agent dispatch

113. ❌ KD-MARL: Resource-Aware Knowledge Distillation in Multi-Agent Reinforcement Learning

作者: Monirul Islam Pavel, Siyi Hu, Muhammad Anwar Masum, Mahardhika Pratama, Ryszard Kowalczyk, Zehong Jimmy Cao 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06691v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于多智能体强化学习（MARL）中的知识蒸馏，与大多数关键词无关，仅与’Multi-agent Systems OR Agent Coordination’高度相关（10分），因为论文核心研究多智能体协调和资源感知部署。其他关键词涉及大模型、训练技术、推理优化等，论文未涉及。

!!! tip deepseek-chat TL;DR

该论文提出了一种资源感知的知识蒸馏框架KD-MARL，用于将集中式专家策略的协调行为转移到轻量级分散式学生智能体，在SMAC和MPE基准测试中保持了超过90%的专家性能，同时计算成本降低了高达28.6倍。

摘要翻译

多智能体强化学习（MARL）系统的实际部署从根本上受到有限的计算内存和推理时间的制约。尽管专家策略能够实现高性能，但它们依赖于昂贵的决策周期和大规模模型，这对于边缘设备或嵌入式平台而言并不实用。知识蒸馏（KD）为资源感知执行提供了一条有前景的路径，但现有的MARL知识蒸馏方法主要局限于动作模仿，往往忽视了协调结构，并假设智能体能力同质。我们提出了面向多智能体强化学习的资源感知知识蒸馏（KD-MARL），这是一个两阶段框架，能够将协调行为从集中式专家策略迁移到轻量化的去中心化学生智能体。学生策略的训练无需评论家网络，而是依赖于蒸馏的优势信号和结构化的策略监督，以在异构且受限的观测条件下保持协调。我们的方法不仅迁移了专家策略的动作层面行为，还迁移了其结构化的协调模式，同时支持异构的学生智能体架构，允许每个智能体的模型容量与其观测复杂度相匹配，这对于在部分可观测、有限可观测以及有限机载资源条件下的高效执行至关重要。在SMAC和MPE基准测试上进行的大量实验表明，KD-MARL在显著降低计算成本的同时，实现了高性能保持。在标准的多智能体基准测试中，KD-MARL保持了超过90%的专家性能，同时将计算成本（以浮点运算次数计）降低了高达28.6倍。所提出的方法实现了专家级的协调能力，并通过结构化蒸馏将其保持下来，从而使得多智能体强化学习能够在资源受限的机载平台上实现实际部署。

摘要 (Abstract)

Real world deployment of multi agent reinforcement learning MARL systems is fundamentally constrained by limited compute memory and inference time. While expert policies achieve high performance they rely on costly decision cycles and large scale models that are impractical for edge devices or embedded platforms. Knowledge distillation KD offers a promising path toward resource aware execution but existing KD methods in MARL focus narrowly on action imitation often neglecting coordination structure and assuming uniform agent capabilities. We propose resource aware Knowledge Distillation for Multi Agent Reinforcement Learning KD MARL a two stage framework that transfers coordinated behavior from a centralized expert to lightweight decentralized student agents. The student policies are trained without a critic relying instead on distilled advantage signals and structured policy supervision to preserve coordination under heterogeneous and limited observations. Our approach transfers both action level behavior and structural coordination patterns from expert policies while supporting heterogeneous student architectures allowing each agent model capacity to match its observation complexity which is crucial for efficient execution under partial or limited observability and limited onboard resources. Extensive experiments on SMAC and MPE benchmarks demonstrate that KD MARL achieves high performance retention while substantially reducing computational cost. Across standard multi agent benchmarks KD MARL retains over 90 percent of expert performance while reducing computational cost by up to 28.6 times FLOPs. The proposed approach achieves expert level coordination and preserves it through structured distillation enabling practical MARL deployment across resource constrained onboard platforms.

关键词: Knowledge Distillation, Multi-Agent Reinforcement Learning, Resource-Aware, Coordination, Edge Devices, Computational Efficiency, Heterogeneous Agents, Policy Transfer

114. ❌ ChemVLR: Prioritizing Reasoning in Perception for Chemical Vision-Language Understanding

作者: Xuanle Zhao, Xinyuan Cai, Xiang Cheng, Xiuyi Chen, Bo Xu 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06685v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文ChemVLR专注于化学领域的视觉-语言理解，核心创新在于将大语言模型（LLMs）的推理能力融入视觉感知过程，强调可解释的推理路径。高度相关的关键词包括：LLMs（核心组件）、Chain of Thought/System 2 Thinking（显式推理路径）、Explainable AI（可解释性设计）、AI for Science（化学领域应用）。中等相关的关键词：Pre-training/Domain Adaptation和SFT（三阶段训练框架涉及）、Scaling Laws/Data Quality（高质量数据集构建）。其他关键词如MoE、SLMs、Alignment、RAG等未在论文中涉及，评分为0。

!!! tip deepseek-chat TL;DR

该论文针对化学视觉理解中现有模型缺乏可解释推理的问题，提出了ChemVLR模型，通过细粒度化学描述符识别和显式推理路径生成，在分子和反应任务上实现了最先进的性能。

摘要翻译

尽管视觉语言模型（VLMs）在化学视觉理解领域展现出巨大潜力，但当前模型主要针对直接视觉问答任务进行优化。这种范式往往导致系统成为“黑箱”，未能充分利用大型语言模型（LLMs）推断潜在反应机制的内在能力。本研究提出ChemVLR，一种旨在感知过程中优先进行推理的化学VLM。与传统化学VLMs不同，ChemVLR通过先显式识别细粒度化学描述符（如官能团），再生成答案的方式对视觉输入进行精细化分析。该方法确保了对复杂视觉化学问题产生明确且可解释的推理路径。为实现这一方法，我们采用跨模态逆向工程策略，结合严格的过滤流程，构建了一个包含分子与反应任务共76万高质量样本的大规模推理-描述数据集。此外，我们采用三阶段训练框架，系统性地构建模型的感知与推理能力。实验表明，ChemVLR实现了最先进的性能，超越了领先的专有模型和特定领域的开源基线模型。我们还提供了全面的消融研究以验证训练策略与数据生成设计。代码与模型权重将在https://github.com/xxlllz/ChemVLR发布。

摘要 (Abstract)

While Vision-Language Models (VLMs) have demonstrated significant potential in chemical visual understanding, current models are predominantly optimized for direct visual question-answering tasks. This paradigm often results in “black-box” systems that fail to utilize the inherent capability of Large Language Models (LLMs) to infer underlying reaction mechanisms. In this work, we introduce ChemVLR, a chemical VLM designed to prioritize reasoning within the perception process. Unlike conventional chemical VLMs, ChemVLR analyzes visual inputs in a fine-grained manner by explicitly identifying granular chemical descriptors, such as functional groups, prior to generating answers. This approach ensures the production of explicit and interpretable reasoning paths for complex visual chemical problems. To facilitate this methodology, we implement a cross-modality reverse-engineering strategy, combined with a rigorous filtering pipeline, to curate a large-scale reasoning-and-captioning dataset comprising 760k high-quality samples across molecular and reaction tasks. Furthermore, we adopt a three-stage training framework that systemically builds model perception and reasoning capacity. Experiments demonstrate that ChemVLR achieves state-of-the-art (SOTA) performance, surpassing both leading proprietary models and domain-specific open-source baselines. We also provide comprehensive ablation studies to validate our training strategy and data generation designs. Code and model weights will be available at https://github.com/xxlllz/ChemVLR.

关键词: Vision-Language Models, Chemical Visual Understanding, Reasoning in Perception, Large Language Models, Interpretable Reasoning, Cross-modality Reverse-engineering, Three-stage Training, State-of-the-art Performance

115. ❌ Between Century and Poet: Graph-Based Lexical Semantic Change in Persian Poetry

作者: Kourosh Shahnazari, Seyed Moein Ayyoubzadeh, Mohammadali Keshtparvar 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06674v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究波斯诗歌中的词汇语义变化，使用Word2Vec和基于图的邻域分析方法，属于数字人文领域。所有评分关键词均涉及大模型、深度学习技术原理或AI在科学领域的应用，而该论文未使用任何大模型或深度学习技术，也未涉及AI在生物信息学等科学领域的应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了波斯诗歌中词汇语义如何随时间变化，通过结合Word2Vec和图分析方法发现语义变化主要表现为邻域重连而非抽象漂移，为数字人文提供了更贴近文学实践的计算分析框架。

摘要翻译

波斯诗歌中的意义兼具历史性与关联性。词语在文学传统中持续存在，同时通过不断变化的邻近词群、修辞框架和诗歌声音改变其影响力。本研究运用对齐的Word2Vec向量空间结合基于图谱的邻近性分析，跨越数个世纪及主要诗人群体来考察这一过程。它并非仅将语义演变建模为向量位移，而是将词汇历史视为局部语义图谱的重构过程：邻近词的获得与丧失、桥梁角色的转移以及跨语义社区的流动。分析聚焦于二十个目标词，并以五个反复出现的参照词为锚点：大地、夜晚、两个酒相关术语以及心灵。围绕它们的是情感、宫廷、自然元素与苏菲概念，如爱情、忧伤、苦修僧、君王、寂灭与真理。这些词汇呈现出不同的演变模式：夜晚更具时间敏感性，大地更依赖诗人个体特质，而心灵虽在图谱角色中流动却表现出连续性。两个酒术语凸显了探测敏感性差异：一个词义宽泛且语义分散，另一个则更为狭窄稳定。词汇审计证实，语料库包含历史驱动型术语、诗人特异性用法以及需谨慎对待的稀疏神秘主义词汇。总体而言，将波斯诗歌的语义演变理解为邻近网络重构，比抽象漂移模型更能有效捕捉其本质。对于数字人文领域，该方法为计算分析恢复了局部结构，并支持更贴近文学实践的解释路径：延续性、迁移性、中介性以及选择性转化。

摘要 (Abstract)

Meaning in Persian poetry is both historical and relational. Words persist through literary tradition while shifting their force through changing constellations of neighbors, rhetorical frames, and poetic voices. This study examines that process using aligned Word2Vec spaces combined with graph-based neighborhood analysis across centuries and major poets. Rather than modeling semantic change as vector displacement alone, it treats lexical history as the rewiring of local semantic graphs: the gain and loss of neighbors, shifts in bridge roles, and movement across communities. The analysis centers on twenty target words, anchored by five recurrent reference terms: Earth, Night, two wine terms, and Heart. Surrounding them are affective, courtly, elemental, and Sufi concepts such as Love, Sorrow, Dervish, King, Annihilation, and Truth. These words exhibit distinct patterns of change. Night is more time-sensitive, Earth more poet-sensitive, and Heart shows continuity despite graph-role mobility. The two wine terms highlight probe sensitivity: one is broad and semantically diffuse, while the other is narrower and more stable. A lexical audit confirms that the corpus contains historically driven terms, poet-specific usages, and sparsely attested mystical vocabulary requiring caution. Overall, semantic change in Persian poetry is better captured as neighborhood rewiring than as abstract drift. For Digital Humanities, this approach restores local structure to computational analysis and supports interpretations closer to literary practice: persistence, migration, mediation, and selective transformation.

关键词: Persian poetry, lexical semantic change, Word2Vec, graph-based analysis, neighborhood rewiring, Digital Humanities, semantic graphs, historical linguistics

116. ❌ A Graph-Enhanced Defense Framework for Explainable Fake News Detection with LLM

作者: Bo Wang, Jing Ma, Hongzhan Lin, Zhiwei Yang, Ruichao Yang, Yuan Tian, Yi Chang 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06666v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在假新闻检测中的应用，明确使用LLMs和RAG技术，属于大模型在不同领域的研究应用。论文重点在于可解释性AI（Explainable AI），通过图结构提供细粒度解释。与事实性（Factuality）有一定关联，因为假新闻检测涉及真实性验证。其他关键词如MoE、SLMs、训练方法、推理优化、代理系统等均未涉及。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于图增强的防御框架（G-Defense），利用检索增强生成（RAG）和大语言模型（LLMs）进行可解释的假新闻检测，在真实性和解释质量方面均达到最先进性能。

摘要翻译

可解释虚假新闻检测旨在评估新闻声明的真实性，同时提供易于理解的解释。现有结合调查性新闻的方法通常效率低下，且难以应对突发新闻。大语言模型（LLM）的最新进展使得利用外部检索的报告作为检测和解释生成的证据成为可能，但未经核实的报告可能引入不准确信息。此外，有效的可解释虚假新闻检测应为声明的所有方面提供易于理解的解释，以协助公众验证其准确性。为应对这些挑战，我们提出一种图增强防御框架（G-Defense），该框架仅基于未经核实的报告提供细粒度解释。具体而言，我们通过将新闻声明分解为若干子声明并建模其依赖关系，构建一个以声明为中心的图。针对每个子声明，我们采用检索增强生成（Retrieval-Augmented Generation, RAG）技术检索关键证据并生成竞争性解释。随后，我们引入一个基于图的类辩护推理模块来评估整体真实性。最后，我们提示大语言模型生成直观的解释图。实验结果表明，G-Defense在真实性检测和解释质量方面均达到了最先进的性能。

摘要 (Abstract)

Explainable fake news detection aims to assess the veracity of news claims while providing human-friendly explanations. Existing methods incorporating investigative journalism are often inefficient and struggle with breaking news. Recent advances in large language models (LLMs) enable leveraging externally retrieved reports as evidence for detection and explanation generation, but unverified reports may introduce inaccuracies. Moreover, effective explainable fake news detection should provide a comprehensible explanation for all aspects of a claim to assist the public in verifying its accuracy. To address these challenges, we propose a graph-enhanced defense framework (G-Defense) that provides fine-grained explanations based solely on unverified reports. Specifically, we construct a claim-centered graph by decomposing the news claim into several sub-claims and modeling their dependency relationships. For each sub-claim, we use the retrieval-augmented generation (RAG) technique to retrieve salient evidence and generate competing explanations. We then introduce a defense-like inference module based on the graph to assess the overall veracity. Finally, we prompt an LLM to generate an intuitive explanation graph. Experimental results demonstrate that G-Defense achieves state-of-the-art performance in both veracity detection and the quality of its explanations.

关键词: fake news detection, large language models, retrieval-augmented generation, explainable AI, graph-based framework, veracity assessment, evidence retrieval, explanation generation

117. ❌ A Parameter-Efficient Transfer Learning Approach through Multitask Prompt Distillation and Decomposition for Clinical NLP

作者: Cheng Peng, Mengxian Lyu, Ziyi Chen, Yonghui Wu 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06650v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	15.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究参数高效微调（PEFT）方法，特别是与LoRA对比，因此’PEFT OR LoRA OR Parameter-efficient Fine-tuning’得15分（核心内容）。论文使用LLaMA、Meditron等大模型作为骨干，因此’Large Language Models OR LLMs OR Foundation Models’得10分（高度相关）。论文涉及监督微调（SFT）方法，因此’Post-training OR Supervised Fine-tuning OR SFT’得10分（高度相关）。论文应用于临床NLP，属于AI for Science领域，因此’AI for Science OR Bioinformatics OR Cheminformatics’得10分（高度相关）。其他关键词如MoE、SLMs、RAG、推理加速等未在摘要中提及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种多任务提示蒸馏与分解框架，用于临床自然语言处理任务，在仅使用少于0.05%可训练参数的情况下，在多个临床NLP任务上超越了LoRA和单任务提示调优方法。

摘要翻译

现有的基于提示的微调方法通常独立学习任务特定的提示，在部署多个临床自然语言处理（NLP）系统时会产生巨大的计算和存储开销。我们提出了一种多任务提示蒸馏与分解框架，该框架从21个不同的临床源任务中学习一个共享的元提示，并以少于0.05%的可训练参数将其适配到未见过的目标任务上。该框架在三种骨干模型（LLaMA 3.1 8B、Meditron3 8B、gpt-oss 20B）上，针对五种临床NLP任务类型（命名实体识别、关系抽取、问答、自然语言推理和摘要生成），在10个预留的目标数据集上进行了评估。尽管使用的参数数量少几个数量级，我们的框架始终比LoRA方法性能高出1.5%~1.7%，并超过单任务提示微调6.1%~6.6%。其中，gpt-oss 20B模型取得了最高的整体性能，尤其在临床推理任务上表现突出。强大的零样本和少样本性能证明了共享提示表征具有更好的可迁移性。

摘要 (Abstract)

Existing prompt-based fine-tuning methods typically learn task-specific prompts independently, imposing significant computing and storage overhead at scale when deploying multiple clinical natural language processing (NLP) systems. We present a multitask prompt distillation and decomposition framework that learns a single shared metaprompt from 21 diverse clinical source tasks and adapts it to unseen target tasks with fewer than 0.05% trainable parameters. Evaluated across five clinical NLP task types (named entity recognition, relation extraction, question answering, natural language inference, and summarization) on 10 held-out target datasets using three backbone models (LLaMA 3.1 8B, Meditron3 8B, gpt-oss 20B), our framework consistently outperforms LoRA by 1.5~~1.7% despite using orders of magnitude fewer parameters, and exceeds single-task prompt tuning by 6.1~~6.6%. The gpt-oss 20B model achieves the highest overall performance, particularly on clinical reasoning tasks. The strong zero- and few-shot performance demonstrates better transferability of the shared prompt representation.

关键词: Parameter-efficient fine-tuning, Multitask prompt distillation, Clinical NLP, Transfer learning, LoRA, Large language models, Few-shot learning, Prompt tuning

118. ❌ RPM-Net Reciprocal Point MLP Network for Unknown Network Security Threat Detection

作者: Jiachen Zhang, Yueming Lu, Fan Feng, Zhanfeng Wang, Shengli Pan, Daoqi Han 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06638v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于网络安全的未知威胁检测，提出了一种基于MLP的RPM-Net框架，使用互逆点机制和对抗性边界约束。论文内容与大多数关键词（如LLM、MoE、SFT、RLHF、RAG、量化等）完全无关，因为这些关键词涉及大模型技术、训练方法、推理优化等，而本文未涉及任何大模型或深度学习技术原理的创新。仅与’Mechanistic Interpretability OR Explainable AI’有一定关联（5分），因为论文提到了几何可解释性；与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），因为网络安全可视为AI在科学/工程领域的应用，但非核心生物信息学或化学信息学。

!!! tip deepseek-chat TL;DR

该论文提出RPM-Net框架，通过互逆点机制和对抗性边界约束解决多类不平衡环境中的未知网络安全威胁检测问题，实验表明其在F1-score、AUROC等指标上优于现有方法。

摘要翻译

在多类别不平衡环境中有效检测未知网络安全威胁对于维护网络空间安全至关重要。现有方法侧重于学习类别表征，但在未知威胁检测、类别不平衡和可解释性缺乏方面面临挑战，限制了其实际应用。为此，我们提出RPM-Net这一新颖框架，该框架引入互逆点机制学习每个已知攻击类别的“非类”表征，并结合对抗性边界约束为未知威胁检测提供几何可解释性。RPM-Net++通过费舍尔判别正则化进一步提升了性能。实验结果表明，RPM-Net在F1分数、AUROC和AUPR-OUT（异常类平均精度）等多个指标上均取得优越性能，显著超越现有方法，并为实际网络安全应用提供实用价值。我们的代码公开于：https://github.com/chiachen-chang/RPM-Net

摘要 (Abstract)

Effective detection of unknown network security threats in multi-class imbalanced environments is critical for maintaining cyberspace security. Current methods focus on learning class representations but face challenges with unknown threat detection, class imbalance, and lack of interpretability, limiting their practical use. To address this, we propose RPM-Net, a novel framework that introduces reciprocal point mechanism to learn “non-class” representations for each known attack category, coupled with adversarial margin constraints that provide geometric interpretability for unknown threat detection. RPM-Net++ further enhances performance through Fisher discriminant regularization. Experimental results show that RPM-Net achieves superior performance across multiple metrics including F1-score, AUROC, and AUPR-OUT, significantly outperforming existing methods and offering practical value for real-world network security applications. Our code is available at:https://github.com/chiachen-chang/RPM-Net

关键词: network security, threat detection, unknown threats, class imbalance, reciprocal point mechanism, adversarial margin constraints, interpretability, RPM-Net

119. ❌ SHAPE: Stage-aware Hierarchical Advantage via Potential Estimation for LLM Reasoning

作者: Zhengyang Ai, Zikang Shan, Xiaodong Ai, Jingxian Tang, Hangkai Hu, Pinyan Lu 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06636v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文SHAPE专注于提升LLM的推理能力，属于大模型技术原理创新。核心相关关键词：1) ‘Large Language Models’ (10分)：论文明确针对LLM推理改进；2) ‘Chain of Thought’ (10分)：论文解决过程监督中的推理效率问题，与CoT高度相关；3) ‘System 2 Thinking’ (8分)：论文涉及深度推理优化，但未明确使用该术语。其他关键词如MoE、SFT、RAG等与论文内容无关，均给0分。

!!! tip deepseek-chat TL;DR

论文提出SHAPE框架，通过分层优势估计解决LLM推理中过程监督的进展评估和令牌效率问题，在数学推理任务上实现了3%准确率提升和30%令牌消耗减少。

摘要翻译

过程监督已成为提升大语言模型推理能力的一种有效方法，但现有方法无法区分实质性进展与冗余表达，导致推理能力受限且计算效率低下。为此，我们提出基于潜力估计的阶段感知分层优势框架（Stage-aware Hierarchical Advantage via Potential Estimation，SHAPE），该框架将推理形式化为在经验可解性状态空间中的轨迹演进。SHAPE引入分层信用分配机制：在段落层面，采用阶段感知优势函数优先实现低潜力状态的高效突破；在词元层面，通过熵驱动重分布机制锐化执行信号。在三个基础模型和五个数学推理基准上的大量实验表明，SHAPE在平均准确率提升3%的同时减少了30%的词元消耗。

摘要 (Abstract)

Process supervision has emerged as a promising approach for enhancing LLM reasoning, yet existing methods fail to distinguish meaningful progress from mere verbosity, leading to limited reasoning capabilities and unresolved token inefficiency. To address this, we propose Stage-aware Hierarchical Advantage via Potential Estimation (SHAPE), a framework that formalizes reasoning as a trajectory through a state space of empirical solvability. SHAPE introduces a hierarchical credit assignment mechanism: at the segment level, it employs a stage-aware advantage function to prioritize efficient breakthroughs in low-potential states; at the token level, it utilizes entropy-driven redistribution to sharpen execution signals. Extensive experiments in math reasoning across three base models and five benchmarks demonstrate that SHAPE achieves an average accuracy gain of 3% with 30% reduced token consumption.

关键词: LLM reasoning, process supervision, hierarchical advantage, potential estimation, token efficiency, math reasoning, stage-aware, credit assignment

120. ❌ Appear2Meaning: A Cross-Cultural Benchmark for Structured Cultural Metadata Inference from Images

作者: Yuechen Jiang, Enze Zhang, Md Mohsinul Kabir, Qianqian Xie, Stavroula Golfomitsou, Konstantinos Arvanitis, Sophia Ananiadou 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07338v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文主要研究视觉语言模型在文化遗产领域的应用，特别是从图像推断结构化文化元数据（如创作者、起源、时期）。论文使用LLM-as-Judge框架进行评估，因此与’Large Language Models’有一定关联（5分）。研究属于AI在科学/文化领域的应用，与’AI for Science’相关（5分）。其他关键词主要涉及大模型技术原理、训练方法、推理优化等，论文未涉及这些具体技术，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个跨文化基准，用于评估视觉语言模型从图像推断结构化文化元数据的能力，发现当前模型在不同文化和元数据类型上表现不一致且预测基础薄弱。

摘要翻译

视觉语言模型在文化遗产图像描述领域的最新进展提升了相关任务的性能。然而，从视觉输入中推断结构化文化元数据（如创作者、起源地、时期）的研究仍显不足。为此，我们针对该任务引入了一个多类别、跨文化的基准数据集，并采用大语言模型即评判框架来评估视觉语言模型，该框架通过测量模型输出与参考标注之间的语义对齐度进行评测。为评估文化推理能力，我们报告了跨文化区域的精确匹配、部分匹配及属性级准确率。结果表明，现有模型仅能捕捉碎片化信号，且在不同文化背景和元数据类型上表现出显著的性能差异，导致预测结果不一致且缺乏充分依据。这些发现凸显了当前视觉语言模型在超越视觉感知的结构化文化元数据推断方面存在局限性。

摘要 (Abstract)

Recent advances in vision-language models (VLMs) have improved image captioning for cultural heritage. However, inferring structured cultural metadata (e.g., creator, origin, period) from visual input remains underexplored. We introduce a multi-category, cross-cultural benchmark for this task and evaluate VLMs using an LLM-as-Judge framework that measures semantic alignment with reference annotations. To assess cultural reasoning, we report exact-match, partial-match, and attribute-level accuracy across cultural regions. Results show that models capture fragmented signals and exhibit substantial performance variation across cultures and metadata types, leading to inconsistent and weakly grounded predictions. These findings highlight the limitations of current VLMs in structured cultural metadata inference beyond visual perception.

关键词: vision-language models, cultural heritage, structured metadata inference, cross-cultural benchmark, LLM-as-Judge, cultural reasoning, image captioning, semantic alignment

121. ❌ OpenSpatial: A Principled Data Engine for Empowering Spatial Intelligence

作者: Jianhui Liu, Haoze Sun, Wenbo Li, Yanbing Zhang, Rui Yang, Zhiliang Zhu, Yijun Yang, Shenghe Zheng, Nan Jiang, Jiaxiu Jiang, Haoyang Huang, Tien-Tsin Wong, Nan Duan, Xiaojuan Qi 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07296v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《OpenSpatial: A Principled Data Engine for Empowering Spatial Intelligence》专注于空间智能的数据引擎和数据集构建，核心贡献是开源数据生成系统OpenSpatial和3M规模数据集OpenSpatial-3M，用于提升空间感知和推理任务性能。论文内容涉及空间测量、关系、相机感知、多视图一致性和场景感知推理等具体任务，但未涉及大模型、深度学习技术原理、科学领域AI应用或任何评分关键词中的具体技术（如LLM、MoE、训练方法、推理优化、AI代理等）。所有关键词均与论文主题无关，因此相关度评分均为0。

!!! tip deepseek-chat TL;DR

该论文针对空间智能研究缺乏高质量、可扩展数据的问题，提出了开源数据引擎OpenSpatial并构建了包含300万样本的大规模数据集OpenSpatial-3M，显著提升了多种空间推理任务的性能。

摘要翻译

空间理解是人类水平智能的基石。然而，当前研究主要集中于特定领域的数据生产，留下了一个关键空白：缺乏一个能够充分释放高质量空间数据潜力的、有原则的开源引擎。为填补这一空白，我们阐明了一个鲁棒数据生成系统的设计原则，并推出了OpenSpatial——一个专为高质量、广泛可扩展性、广泛任务多样性和优化效率而设计的开源数据引擎。OpenSpatial采用三维边界框作为基本图元，构建了一个涵盖五大基础任务的综合数据层次结构：空间测量、空间关系、相机感知、多视图一致性和场景感知推理。利用这一可扩展的基础设施，我们构建了OpenSpatial-3M，这是一个包含300万个高保真样本的大规模数据集。广泛的评估表明，基于我们数据集训练的多功能模型在广泛的空间推理基准测试中实现了最先进的性能。值得注意的是，表现最佳的模型相对实现了平均19%的显著提升。此外，我们系统分析了数据属性如何影响空间感知。通过开源引擎和300万规模的数据集，我们为加速未来空间智能研究提供了坚实的基础。

摘要 (Abstract)

Spatial understanding is a fundamental cornerstone of human-level intelligence. Nonetheless, current research predominantly focuses on domain-specific data production, leaving a critical void: the absence of a principled, open-source engine capable of fully unleashing the potential of high-quality spatial data. To bridge this gap, we elucidate the design principles of a robust data generation system and introduce OpenSpatial – an open-source data engine engineered for high quality, extensive scalability, broad task diversity, and optimized efficiency. OpenSpatial adopts 3D bounding boxes as the fundamental primitive to construct a comprehensive data hierarchy across five foundational tasks: Spatial Measurement (SM), Spatial Relationship (SR), Camera Perception (CP), Multi-view Consistency (MC), and Scene-Aware Reasoning (SAR). Leveraging this scalable infrastructure, we curate OpenSpatial-3M, a large-scale dataset comprising 3 million high-fidelity samples. Extensive evaluations demonstrate that versatile models trained on our dataset achieve state-of-the-art performance across a wide spectrum of spatial reasoning benchmarks. Notably, the best-performing model exhibits a substantial average improvement of 19 percent, relatively. Furthermore, we provide a systematic analysis of how data attributes influence spatial perception. By open-sourcing both the engine and the 3M-scale dataset, we provide a robust foundation to accelerate future research in spatial intelligence.

关键词: spatial intelligence, data engine, OpenSpatial, large-scale dataset, spatial reasoning, 3D bounding boxes, scene-aware reasoning, data generation system

122. ❌ Why teaching resists automation in an AI-inundated era: Human judgment, non-modular work, and the limits of delegation

作者: Songhee Han 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07285v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要讨论AI在教育领域的应用限制，特别是教学工作的不可自动化特性。摘要中明确提到了"large language models and retrieval-augmented generation systems"，因此与"Large Language Models"和"Retrieval-Augmented Generation"关键词有一定关联（5分）。然而，论文并非技术研究，而是教育社会学角度的批判性分析，不涉及具体的大模型技术原理、训练方法、优化技术或科学应用，因此其他所有关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了为什么教学在AI时代难以自动化，认为教学工作的解释性、关系性和专业判断特性使其无法被技术完全替代，尽管AI可以支持某些有限功能。

摘要翻译

关于人工智能（AI）在教育领域的讨论常将教学描绘为一种模块化、程序化的工作，认为其可被日益自动化或交由技术代劳。本简报论文指出，此类论断建立在将教学视为可分离性高于其实践状态的基础上。通过援引近期关于大语言模型与检索增强生成系统的文献及实证研究，本文认为，尽管AI能够支持某些限定性功能，但教学工作因其本质上具有解释性、关系性且植根于专业判断，仍难以被有意义地自动化。更根本而言，教与学受人类认知、行为、动机及社会互动的塑造，其方式无法被完全规定、预测或穷尽建模。那些原则上看似可分离的任务，其教学价值在实践中实则源自对学习者、情境与关系的持续语境化解读。只要教育实践仍依赖于对人类认知与学习的涌现性理解，教学就始终是一种抗拒自动化的专业工作。AI或许能提升信息获取效率并支持特定教学活动，但无法取代有效教学所必需的人类判断与关系性责任。

摘要 (Abstract)

Debates about artificial intelligence (AI) in education often portray teaching as a modular and procedural job that can increasingly be automated or delegated to technology. This brief communication paper argues that such claims depend on treating teaching as more separable than it is in practice. Drawing on recent literature and empirical studies of large language models and retrieval-augmented generation systems, I argue that although AI can support some bounded functions, instructional work remains difficult to automate in meaningful ways because it is inherently interpretive, relational, and grounded in professional judgment. More fundamentally, teaching and learning are shaped by human cognition, behavior, motivation, and social interaction in ways that cannot be fully specified, predicted, or exhaustively modeled. Tasks that may appear separable in principle derive their instructional value in practice from ongoing contextual interpretation across learners, situations, and relationships. As long as educational practice relies on emergent understanding of human cognition and learning, teaching remains a form of professional work that resists automation. AI may improve access to information and support selected instructional activities, but it does not remove the need for human judgment and relational accountability that effective teaching requires.

关键词: teaching automation, large language models, retrieval-augmented generation, human judgment, professional work, educational practice, AI in education, instructional work

123. ❌ ClickGuard: A Trustworthy Adaptive Fusion Framework for Clickbait Detection

作者: Chhavi Dhiman, Naman Chawla, Riya Dhami, Gaurav Kumar, Ganesh Naik 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07272v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究点击诱饵检测，使用BERT嵌入和CNN-BiLSTM等传统深度学习技术，未涉及大模型（LLMs）、MoE、SLMs、扩展定律、预训练/后训练、对齐、RLHF、PEFT、RAG、长上下文、推理加速、思维链、智能体、量化、幻觉缓解、世界模型、模型合并、上下文学习等关键词的核心内容。仅与’Mechanistic Interpretability OR Explainable AI’有一定关联（5分），因为论文使用LIME和PFI进行可解释性分析，但这并非论文核心创新点。

!!! tip deepseek-chat TL;DR

该论文提出ClickGuard框架，通过结合BERT嵌入和结构特征并采用自适应融合模块，有效检测点击诱饵标题，在测试中达到96.93%的准确率，优于现有方法。

摘要翻译

旨在误导用户并最大化互动量的点击诱饵标题的广泛使用，对网络信息的可信度构成了重大挑战。这类标题常采用煽情手法、误导性陈述和模糊语言，凸显了进行有效检测以确保数字内容可信度的迫切需求。本文提出了ClickGuard：一种用于点击诱饵检测的可信自适应融合框架。该框架通过一个句法-语义自适应融合块（SSAFB），动态地融合了BERT嵌入表示和结构特征。框架采用混合CNN-BiLSTM架构以捕捉文本模式和依赖关系。该模型取得了96.93%的测试准确率，性能优于现有先进方法。为评估模型的可信度，研究采用LIME和排列特征重要性（PFI）进行可解释性分析和扰动分析。这些方法通过测量平均预测变化，评估了模型的鲁棒性以及对特征变化的敏感性。消融研究验证了SSAFB在优化特征融合方面的有效性。该模型在多个不同数据集上均表现出稳健的性能，通过应对句法-语义建模的挑战，为提升网络内容可信度提供了一个可扩展、可靠的解决方案。相关工作代码可见于：https://github.com/palindromeRice/ClickBait_Detection_Architecture

摘要 (Abstract)

The widespread use of clickbait headlines, crafted to mislead and maximize engagement, poses a significant challenge to online credibility. These headlines employ sensationalism, misleading claims, and vague language, underscoring the need for effective detection to ensure trustworthy digital content. The paper introduces, ClickGuard: a trustworthy adaptive fusion framework for clickbait detection. It combines BERT embeddings and structural features using a Syntactic-Semantic Adaptive Fusion Block (SSAFB) for dynamic integration. The framework incorporates a hybrid CNN-BiLSTM to capture patterns and dependencies. The model achieved 96.93% testing accuracy, outperforming state-of-the-art approaches. The model’s trustworthiness is evaluated using LIME and Permutation Feature Importance (PFI) for interpretability and perturbation analysis. These methods assess the model’s robustness and sensitivity to feature changes by measuring the average prediction variation. Ablation studies validated the SSAFB’s effectiveness in optimizing feature fusion. The model demonstrated robust performance across diverse datasets, providing a scalable, reliable solution for enhancing online content credibility by addressing syntactic-semantic modelling challenges. Code of the work is available at: https://github.com/palindromeRice/ClickBait_Detection_Architecture

关键词: clickbait detection, BERT embeddings, adaptive fusion, CNN-BiLSTM, interpretability, trustworthy framework, syntactic-semantic modeling, feature importance

124. ❌ Joint Optimization of Reasoning and Dual-Memory for Self-Learning Diagnostic Agent

作者: Bingxuan Li, Simo Du, Yue Guo 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07269v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出SEA自学习诊断智能体，基于LLMs进行临床推理，属于大模型在生物医学领域的应用创新。核心相关关键词：1）‘Large Language Models’（论文明确使用LLMs构建诊断智能体）；2）‘Chain of Thought’和’System 2 Thinking’（涉及临床推理和多步推理）；3）‘Self-Correction’（自学习机制包含自我改进）；4）‘LLM Agents’（构建诊断智能体）；5）‘AI for Science’（医学诊断应用）。其他关键词如MoE、量化、RAG等未涉及。

!!! tip deepseek-chat TL;DR

该研究针对现有LLM诊断智能体缺乏经验复用和持续适应的问题，提出了具有双记忆模块的自学习诊断智能体SEA，通过联合优化推理和记忆管理，在临床诊断任务中显著提升了准确率和持续学习能力。

摘要翻译

临床专业能力的提升不仅依赖于医学知识的获取，更源于积累可复用的诊断模式经验。近期基于大语言模型的诊断智能体在临床推理决策支持方面展现出良好进展。然而，现有方法大多独立处理病例，限制了经验复用与持续适应能力。我们提出SEA——一种具备认知启发的双记忆模块自学习诊断智能体。我们设计了一个强化训练框架，针对该智能体联合优化推理与记忆管理能力。我们在两种互补场景中评估SEA的性能：在MedCaseReasoning数据集的标准评估中，SEA达到92.46%的准确率，超越最强基线方法19.6%，验证了联合优化推理与记忆机制的有效性；在ER-Reason数据集的长周期评估中，SEA取得最佳最终准确率（0.7214）与最大提升幅度（+0.35 Acc@100），而基线方法仅呈现有限或不稳定的增益。专家评估进一步表明，SEA整合生成的规则具有显著的临床正确性、实用价值与可信度，证明双记忆模块中归纳的规则可靠且具备实际意义。总体而言，SEA通过将经验有效转化为可复用知识，同步提升了诊断推理能力与持续学习性能。

摘要 (Abstract)

Clinical expertise improves not only by acquiring medical knowledge, but by accumulating experience that yields reusable diagnostic patterns. Recent LLMs-based diagnostic agents have shown promising progress in clinical reasoning for decision support. However, most approaches treat cases independently, limiting experience reuse and continual adaptation. We propose SEA, a self-learning diagnostic agent with cognitively inspired dual-memory module. We design a reinforcement training framework tailored to our designed agent for joint optimization of reasoning and memory management. We evaluate SEA in two complementary settings. On standard evaluation with MedCaseReasoning dataset, SEA achieves 92.46% accuracy, outperforming the strongest baseline by +19.6%, demonstrating the benefit of jointly optimizing reasoning and memory. On the long-horizon with ER-Reason dataset, SEA attains the best final accuracy (0.7214) and the largest improvement (+0.35 Acc@100), while baseline methods show limited or unstable gains. Expert evaluation further indicates that rules consolidated from SEA show strong clinical correctness, usefulness and trust, suggesting that the induced rules in dual-memory module are reliable and practically meaningful. Overall, SEA improves both diagnostic reasoning ability and continual learning by effectively transforming experience into reusable knowledge.

关键词: Self-learning diagnostic agent, Dual-memory module, Clinical reasoning, LLM-based agents, Reinforcement training, Experience reuse, Continual adaptation, Medical diagnosis

125. ❌ Efficient Learned Data Compression via Dual-Stream Feature Decoupling

作者: Huidong Ma, Xinyan Shi, Hui Sun, Xiaofei Yue, Xiaoguang Liu, Gang Wang, Wentong Cai 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07239v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于学习型数据压缩（LDC）的架构优化，提出了一种双流多尺度解耦器和并发流并行流水线，以提高压缩比和吞吐量。虽然属于深度学习应用，但研究内容与所有评分关键词（均围绕大语言模型、对齐、推理、代理、科学AI等特定技术）完全无关，未涉及任何关键词提及的具体技术或应用领域。

!!! tip deepseek-chat TL;DR

该论文解决了学习型数据压缩中平衡概率建模精度与系统效率的挑战，通过提出双流多尺度解耦器和并发流并行流水线，实现了在压缩比、吞吐量、延迟和内存使用方面的最先进性能。

摘要翻译

尽管学习型数据压缩（LDC）已实现优异的压缩比，但如何在精确概率建模与系统效率之间取得平衡仍具挑战。关键问题在于，统一的单流架构难以同时捕获微观句法和宏观语义特征，这迫使模型采用深度串行堆叠结构，从而加剧了延迟。此外，异构系统受限于设备速度不匹配问题，由于串行处理的存在，其吞吐量受阿姆达尔定律制约。为此，我们提出一种双流多尺度解耦器，将局部与全局上下文分离，以浅层并行流替代深度串行处理，并引入分层门控优化器以实现自适应特征细化和精确概率建模。进一步，我们设计了一种并发流并行流水线，以克服系统瓶颈，实现全流水线并行。大量实验表明，我们的方法在压缩比和吞吐量上均达到了最优性能，同时保持了最低的延迟和内存占用。代码公开于 https://github.com/huidong-ma/FADE。

摘要 (Abstract)

While Learned Data Compression (LDC) has achieved superior compression ratios, balancing precise probability modeling with system efficiency remains challenging. Crucially, uniform single-stream architectures struggle to simultaneously capture micro-syntactic and macro-semantic features, necessitating deep serial stacking that exacerbates latency. Compounding this, heterogeneous systems are constrained by device speed mismatches, where throughput is capped by Amdahl’s Law due to serial processing. To this end, we propose a Dual-Stream Multi-Scale Decoupler that disentangles local and global contexts to replace deep serial processing with shallow parallel streams, and incorporate a Hierarchical Gated Refiner for adaptive feature refinement and precise probability modeling. Furthermore, we design a Concurrent Stream-Parallel Pipeline, which overcomes systemic bottlenecks to achieve full-pipeline parallelism. Extensive experiments demonstrate that our method achieves state-of-the-art performance in both compression ratio and throughput, while maintaining the lowest latency and memory usage. The code is available at https://github.com/huidong-ma/FADE.

关键词: Learned Data Compression, Dual-Stream Architecture, Multi-Scale Decoupler, Parallel Processing, Probability Modeling, Throughput Optimization, Latency Reduction, Memory Efficiency

126. ❌ LaScA: Language-Conditioned Scalable Modelling of Affective Dynamics

作者: Kosmas Pinitas, Ilias Maglogiannis 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07193v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文提出了一种使用语言模型（LMs）作为语义上下文调节器来建模情感动态的新框架，属于大模型在情感计算领域的应用创新。与关键词的相关性分析如下：1）论文明确使用“Language Models (LMs)”，属于大模型范畴，但未特指LLMs或基础模型，给8分；2）论文强调框架的“interpretability”和“transparency”，与“Explainable AI”高度相关，给8分；3）论文使用“pretrained LM”，涉及预训练模型的应用，与“Pre-training”有一定关联，给5分；4）论文应用于情感预测，属于AI在科学/心理学领域的应用，与“AI for Science”有一定关联，给5分；其他关键词如MoE、SFT、RAG、量化等均未在论文中涉及，给0分。

!!! tip deepseek-chat TL;DR

该研究提出了一种使用语言模型作为语义上下文调节器来建模情感动态的新框架，在Aff-Wild2和SEWA数据集上的实验表明，该方法在保持特征可解释性的同时，在效价和唤醒度预测上比手工特征和深度嵌入基线取得了更准确的性能。

摘要翻译

在无约束环境中预测情感状态仍是以人为中心的人工智能领域的一项根本性挑战。尽管深度神经嵌入主导了当前的主流方法，但其往往缺乏可解释性，并限制了专家驱动的模型优化。我们提出了一种新颖框架，该框架利用语言模型（LMs）作为手工构建的情感描述符的语义上下文调节器，以建模效价（Valence）和唤醒度（Arousal）的变化。我们的方法始于从结构化领域知识中提取的可解释的面部几何特征和声学特征。这些特征被转化为符号化的自然语言描述，编码其情感含义。一个预训练的语言模型处理这些描述，生成语义上下文嵌入，这些嵌入作为情感动态的高层先验信息。与端到端的黑箱流程不同，我们的框架在利用语言模型上下文抽象能力的同时，保留了特征透明度。我们在Aff-Wild2和SEWA数据集上对所提方法进行了情感变化预测的评估。实验结果表明，与仅使用手工特征和深度嵌入的基线方法相比，该方法在效价和唤醒度的预测准确性上均取得了持续提升。我们的研究证明，语义调节能够在不牺牲预测性能的前提下实现可解释的情感建模，为完全端到端的架构提供了一种透明且计算高效的替代方案。

摘要 (Abstract)

Predicting affect in unconstrained environments remains a fundamental challenge in human-centered AI. While deep neural embeddings dominate contemporary approaches, they often lack interpretability and limit expert-driven refinement. We propose a novel framework that uses Language Models (LMs) as semantic context conditioners over handcrafted affect descriptors to model changes in Valence and Arousal. Our approach begins with interpretable facial geometry and acoustic features derived from structured domain knowledge. These features are transformed into symbolic natural-language descriptions encoding their affective implications. A pretrained LM processes these descriptions to generate semantic context embeddings that act as high-level priors over affective dynamics. Unlike end-to-end black-box pipelines, our framework preserves feature transparency while leveraging the contextual abstraction capabilities of LMs. We evaluate the proposed method on the Aff-Wild2 and SEWA datasets for affect change prediction. Experimental results show consistent improvements in accuracy for both Valence and Arousal compared to handcrafted-only and deep-embedding baselines. Our findings demonstrate that semantic conditioning enables interpretable affect modelling without sacrificing predictive performance, offering a transparent and computationally efficient alternative to fully end-to-end architectures

关键词: Language Models, affect prediction, interpretable modelling, semantic conditioning, Valence and Arousal, transparent framework, human-centered AI, affective dynamics

127. ❌ Agent-Driven Corpus Linguistics: A Framework for Autonomous Linguistic Discovery

作者: Jia Yu, Weiwei Yu, Pengfei Xiao, Fukun Xing 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07189v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文提出了一种基于大语言模型（LLM）的自主语料库语言学框架，核心是LLM代理通过工具使用接口连接语料库查询引擎，实现假设生成、查询、结果解释和多轮分析。因此，与LLM、LLM代理、工具使用高度相关（10分）。与检索增强生成、思维链、系统2思维、自我修正、幻觉缓解有一定关联（5分），因为这些概念体现在代理的多轮推理、验证和基于证据的分析中。与AI for Science相关（5分），因为将AI应用于语言学（科学的一个分支）。其他关键词如MoE、量化、训练方法等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于大语言模型代理的自主语料库语言学框架，通过工具使用接口连接语料库，实现了从假设生成到结果解释的自动化研究流程，并在英语强化词研究中验证了其能快速产生基于实证的发现。

摘要翻译

传统语料库语言学长期依赖研究者人工提出假设、构建检索式并解读结果——这一过程需要专业技术能力与大量时间。我们提出“智能体驱动的语料库语言学”方法，通过结构化工具调用接口将大语言模型（LLM）与语料库检索引擎连接，由LLM接管研究循环：生成假设、检索语料、解读结果，并进行多轮迭代分析。研究者负责设定方向并评估最终产出。与无约束的LLM生成不同，本方法中所有发现均锚定于可验证的语料证据。我们并不试图取代基于语料库/语料库驱动研究的经典分野，而是将其视为互补维度：该方法关注研究执行主体，而非理论与数据间的认识论关系。我们通过模型上下文协议（Model Context Protocol, MCP）将LLM智能体连接至CQP索引的古登堡语料库（500万词符）以演示该框架。仅给定“研究英语强化词”的指令，智能体即识别出历时性接替链（so+形容词 > very > really）、语义演变的三条路径（去词汇化、极性固化、隐喻制约）以及语域敏感分布。对照实验表明，语料库锚定机制提供了量化验证与可证伪性，这是模型仅凭训练数据无法实现的。为检验外部效度，智能体在CLMET语料库（4000万词符）上复现了Claridge（2025）与De Smet（2013）两项已发表研究，并取得高度一致的量化结果。由此可见，智能体驱动的语料库研究能够以机器速度产出实证性发现，为更广泛的研究者降低技术门槛。

摘要 (Abstract)

Corpus linguistics has traditionally relied on human researchers to formulate hypotheses, construct queries, and interpret results - a process demanding specialized technical skills and considerable time. We propose Agent-Driven Corpus Linguistics, an approach in which a large language model (LLM), connected to a corpus query engine via a structured tool-use interface, takes over the investigative cycle: generating hypotheses, querying the corpus, interpreting results, and refining analysis across multiple rounds. The human researcher sets direction and evaluates final output. Unlike unconstrained LLM generation, every finding is anchored in verifiable corpus evidence. We treat this not as a replacement for the corpus-based/corpus-driven distinction but as a complementary dimension: it concerns who conducts the inquiry, not the epistemological relationship between theory and data. We demonstrate the framework by linking an LLM agent to a CQP-indexed Gutenberg corpus (5 million tokens) via the Model Context Protocol (MCP). Given only “investigate English intensifiers,” the agent identified a diachronic relay chain (so+ADJ > very > really), three pathways of semantic change (delexicalization, polarity fixation, metaphorical constraint), and register-sensitive distributions. A controlled baseline experiment shows that corpus grounding contributes quantification and falsifiability that the model cannot produce from training data alone. To test external validity, the agent replicated two published studies on the CLMET corpus (40 million tokens) - Claridge (2025) and De Smet (2013) - with close quantitative agreement. Agent-driven corpus research can thus produce empirically grounded findings at machine speed, lowering the technical barrier for a broader range of researchers.

关键词: Agent-Driven Corpus Linguistics, LLM agent, autonomous linguistic discovery, tool-use interface, corpus query engine, hypothesis generation, empirically grounded findings, Model Context Protocol

128. ❌ Language Bias under Conflicting Information in Multilingual LLMs

作者: Robert Östling, Murathan Kurfalı 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07123v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	5.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究多语言大语言模型在整合冲突信息时的语言偏见，核心涉及LLMs评估和偏见分析，与’Large Language Models’高度相关（10分）。研究涉及长上下文评估，与’Context Window Extension’有一定关联（5分）。偏见和事实冲突问题与’Hallucination Mitigation’相关（8分）。模型行为分析涉及’Mechanistic Interpretability’（5分）。其他关键词如MoE、SFT、RAG等未在论文中涉及，评0分。

!!! tip deepseek-chat TL;DR

该论文研究了多语言大语言模型在整合冲突信息时存在的语言偏见问题，发现所有测试模型都倾向于忽略冲突并自信地断言其中一个答案，且存在一致的语言偏好模式，特别是对俄语的偏见和对中文的偏好。

摘要翻译

大型语言模型（LLM）在回答问题时整合冲突信息的过程中已被证实存在偏见。本文探讨此类偏见是否同样存在于冲突信息所使用的语言差异上。为回答这一问题，我们将“干草堆中寻针”的冲突测试范式扩展至多语言场景，并基于五种不同语言的自然新闻领域数据，对一系列不同规模的多语言LLM进行了全面评估。研究发现，所有被测模型（包括GPT-5.2）在绝大多数情况下均忽视信息冲突，仅自信地断言其中一种可能答案。此外，所有模型均表现出对特定语言的系统性偏好：整体上存在对俄语内容的偏见，而在最长上下文情境中则偏向中文信息。尽管在中国境内与境外训练的模型均呈现这两种倾向，但前者表现更为显著。

摘要 (Abstract)

Large Language Models (LLMs) have been shown to contain biases in the process of integrating conflicting information when answering questions. Here we ask whether such biases also exist with respect to which language is used for each conflicting piece of information. To answer this question, we extend the conflicting needles in a haystack paradigm to a multilingual setting and perform a comprehensive set of evaluations with naturalistic news domain data in five different languages, for a range of multilingual LLMs of different sizes. We find that all LLMs tested, including GPT-5.2, ignore the conflict and confidently assert only one of the possible answers in the large majority of cases. Furthermore, there is a consistent bias across models in which languages are preferred, with a general bias against Russian and, for the longest context lengths, in favor of Chinese. Both of these patterns are consistent between models trained inside and outside of mainland China, though somewhat stronger in the former category.

关键词: Large Language Models, multilingual LLMs, language bias, conflicting information, needles in a haystack, model evaluation, GPT-5.2, context length

129. ❌ Are Non-English Papers Reviewed Fairly? Language-of-Study Bias in NLP Peer Reviews

作者: Ehsan Barkhordar, Abdulfattah Safa, Verena Blaschke, Erika Lombart, Marie-Catherine de Marneffe, Gözde Gül Şahin 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07119v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究NLP领域同行评审中的语言偏见问题，属于学术出版公平性研究，而非大模型/深度学习技术原理创新或其在科学领域的应用。论文内容涉及偏见检测、数据集构建和实证分析，与所有评分关键词（均聚焦于大模型技术、训练方法、推理优化、应用等）完全无关。

!!! tip deepseek-chat TL;DR

该论文首次系统研究了NLP同行评审中的语言偏见问题，发现非英语论文面临更高的偏见率，其中负面偏见主要体现为要求不合理的跨语言泛化。

摘要翻译

同行评议在自然语言处理（NLP）领域的出版过程中发挥着核心作用，但容易受到各种偏见的影响。本文研究语言研究对象（Language-of-study, LoS）偏见：即审稿人倾向于根据论文所研究的语言（而非其科学价值）对其进行差异化评价的现象。尽管评审指南已明确警示此类偏见，但其具体机制仍鲜为人知。先前的研究将此类评论归入宽泛的“薄弱”或“非建设性”评审类别，而未将其界定为一种独立的偏见形式。我们首次对LoS偏见进行了系统性刻画，区分了其消极与积极表现形式，并发布了人工标注数据集LOBSTER（Language-Of-study Bias in ScienTific pEer Review）以及一种检测方法（宏观F1值达87.37）。通过分析15,645份评审意见，我们评估了消极与积极偏见如何随LoS差异而变化，发现研究非英语语言的论文面临的偏见率显著高于仅研究英语的论文，且消极偏见始终多于积极偏见。最后，我们识别出消极偏见的四个子类别，发现“要求不合理的跨语言泛化”是最主要的表现形式。我们公开所有资源，以支持NLP及其他领域推动更公平的评审实践。

摘要 (Abstract)

Peer review plays a central role in the NLP publication process, but is susceptible to various biases. Here, we study language-of-study (LoS) bias: the tendency for reviewers to evaluate a paper differently based on the language(s) it studies, rather than its scientific merit. Despite being explicitly flagged in reviewing guidelines, such biases are poorly understood. Prior work treats such comments as part of broader categories of weak or unconstructive reviews without defining them as a distinct form of bias. We present the first systematic characterization of LoS bias, distinguishing negative and positive forms, and introduce the human-annotated dataset LOBSTER (Language-Of-study Bias in ScienTific pEer Review) and a method achieving 87.37 macro F1 for detection. We analyze 15,645 reviews to estimate how negative and positive biases differ with respect to the LoS, and find that non-English papers face substantially higher bias rates than English-only ones, with negative bias consistently outweighing positive bias. Finally, we identify four subcategories of negative bias, and find that demanding unjustified cross-lingual generalization is the most dominant form. We publicly release all resources to support work on fairer reviewing practices in NLP and beyond.

关键词: peer review bias, language-of-study bias, NLP, fairness, bias detection, scientific publishing, cross-lingual generalization, review analysis

130. ❌ Selective Neuron Amplification for Training-Free Task Enhancement

作者: Ryyan Akhtar 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07098v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文明确研究大语言模型（LLMs）在推理时任务失败的问题，提出Selective Neuron Amplification方法增强任务相关神经元的影响而不改变模型参数。这直接与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为论文核心就是LLM的推理机制研究。其他关键词如MoE、SFT、RAG、量化等均未在摘要中提及或相关，因此给0分。论文属于大模型技术原理的创新研究，符合研究背景要求。

!!! tip deepseek-chat TL;DR

论文研究大语言模型任务失败的原因，发现某些内部电路激活不足而非知识缺失，并提出Selective Neuron Amplification方法在推理时增强相关神经元影响以提升性能。

摘要翻译

大型语言模型常在其似乎已理解的任务上表现不佳。我们的实验表明，这更多源于推理过程中某些内部回路未被充分激活，而非知识缺失。本文探讨选择性神经元放大技术，该方法在不改变模型参数的前提下，增强任务相关神经元的影响力。该技术作用于推理阶段，不会永久性改变模型。当模型处于不确定状态时，SNA（Selective Neuron Amplification）能显著提升表现；而当模型已具备高置信度时，其效果有限。这表明部分模型失效源于激活强度不足，而非能力缺失。

摘要 (Abstract)

Large language models often fail on tasks they seem to already understand. In our experiments, this appears to be less about missing knowledge and more about certain internal circuits not being strongly activated during inference. We explore Selective Neuron Amplification, which increases the influence of task relevant neurons without changing the model’s parameters. The method works at inference time and does not permanently alter the model. SNA helps mainly when the model is uncertain, while having low effect when the model is already confident. This suggests that some model failures are due to weak activation rather than lack of capability.

关键词: Large language models, Selective Neuron Amplification, inference, task enhancement, internal circuits, neuron activation, training-free, model uncertainty

131. ❌ Yale-DM-Lab at ArchEHR-QA 2026: Deterministic Grounding and Multi-Pass Evidence Alignment for EHR Question Answering

作者: Elyas Irankhah, Samah Fodeh 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07116v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	5.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文主要研究使用大型语言模型（Claude Sonnet 4、GPT-4o、GPT-5.2、GPT-5.1、DeepSeek-R1）进行电子健康记录问答，属于AI在生物医学领域的应用，因此与’Large Language Models’和’AI for Science’高度相关（10分）。论文涉及证据对齐、推理限制、few-shot prompting等，与’Instruction Tuning/Alignment’、‘Retrieval-Augmented Generation’、‘Chain of Thought’、‘System 2 Thinking’、‘Hallucination Mitigation’、‘In-context Learning’有一定关联（5分）。其他关键词如MoE、量化、模型压缩等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了使用多种大型语言模型和集成方法解决电子健康记录问答任务，通过模型多样性、集成投票和上下文提示提高了证据对齐和答案生成的性能，在开发集上取得了最佳任务得分。

摘要翻译

本文介绍了耶鲁-DM-Lab团队为ArchEHR-QA 2026共享任务开发的系统。该任务研究患者针对住院记录提出的问题，包含四个子任务（ST）：临床医生解读的问题重构、证据句识别、答案生成以及证据-答案对齐。ST1采用双模型流水线，结合Claude Sonnet 4和GPT-4o，将患者问题重构为临床医生解读的问题。ST2至ST4依赖于托管于Azure的模型集成（包括o3、GPT-5.2、GPT-5.1和DeepSeek-R1），并结合少量示例提示与投票策略。实验结果显示三个主要发现：首先，与单模型基线相比，模型多样性与集成投票策略能持续提升性能；其次，完整的临床医生答案段落被作为额外提示上下文用于证据对齐；第三，开发集结果表明，对齐准确率主要受推理能力限制。在开发集上取得的最佳分数为：ST4微平均F1值88.81，ST2宏平均F1值65.72，ST3得分34.01，ST1得分33.05。

摘要 (Abstract)

We describe the Yale-DM-Lab system for the ArchEHR-QA 2026 shared task. The task studies patient-authored questions about hospitalization records and contains four subtasks (ST): clinician-interpreted question reformulation, evidence sentence identification, answer generation, and evidence-answer alignment. ST1 uses a dual-model pipeline with Claude Sonnet 4 and GPT-4o to reformulate patient questions into clinician-interpreted questions. ST2-ST4 rely on Azure-hosted model ensembles (o3, GPT-5.2, GPT-5.1, and DeepSeek-R1) combined with few-shot prompting and voting strategies. Our experiments show three main findings. First, model diversity and ensemble voting consistently improve performance compared to single-model baselines. Second, the full clinician answer paragraph is provided as additional prompt context for evidence alignment. Third, results on the development set show that alignment accuracy is mainly limited by reasoning. The best scores on the development set reach 88.81 micro F1 on ST4, 65.72 macro F1 on ST2, 34.01 on ST3, and 33.05 on ST1.

关键词: EHR Question Answering, Large Language Models, Model Ensembles, Evidence Alignment, Few-shot Prompting, Clinical NLP, ArchEHR-QA, Reasoning Limitation

132. ❌ SemEval-2026 Task 3: Dimensional Aspect-Based Sentiment Analysis (DimABSA)

作者: Liang-Chih Yu, Jonas Becker, Shamsuddeen Hassan Muhammad, Idris Abdulmumin, Lung-Hao Lee, Ying-Lung Lin, Jin Wang, Jan Philip Wahle, Terry Ruas, Natalia Loukachevitch, Alexander Panchenko, Ilseyar Alimova, Lilian Wanzare, Nelson Odhiambo, Bela Gipp, Kai-Wei Chang, Saif M. Mohammad 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07066v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文是关于情感分析（ABSA）和立场检测的共享任务组织与评估，属于自然语言处理（NLP）中的情感计算领域。论文内容聚焦于任务设计、数据集构建、评估指标和基线系统，不涉及大模型、深度学习技术原理创新或科学领域应用。所有关键词均与大模型技术、训练方法、推理优化、代理系统、科学AI等主题相关，而本文未提及任何这些技术，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

该论文提出了SemEval-2026共享任务Dimensional Aspect-Based Sentiment Analysis (DimABSA)，通过将情感建模为效价-唤醒维度而非分类极性标签来改进传统ABSA，并扩展至公共议题立场分析，吸引了400多名参与者，发布了基线结果并分析了关键设计选择。

摘要翻译

我们提出SemEval-2026共享任务：维度方面级情感分析（DimABSA），该任务通过效价-唤醒度（Valence-Arousal, VA）维度建模情感，而非使用分类极性标签，从而改进了传统方面级情感分析。为将方面级情感分析从消费评论扩展到公共议题讨论（如政治、能源和气候议题），我们引入了一项附加任务——维度立场分析（DimStance），该任务将立场目标视为方面，并将立场检测重新定义为VA空间中的回归问题。本任务包含两条赛道：赛道A（DimABSA）与赛道B（DimStance）。赛道A包含三个子任务：（1）维度方面情感回归，（2）维度方面情感三元组抽取，以及（3）维度方面情感四元组抽取；而赛道B仅包含针对立场目标的回归子任务。我们还引入了连续F1（cF1）指标，以联合评估结构化抽取与VA回归效果。本任务吸引了超过400名参与者，最终产生了112份提交结果和42篇系统描述论文。我们报告了基线结果，讨论了表现最优的系统，并分析了关键设计选择，从而为方面级和立场目标级的维度情感分析提供见解。所有资源均已发布于我们的GitHub仓库。

摘要 (Abstract)

We present the SemEval-2026 shared task on Dimensional Aspect-Based Sentiment Analysis (DimABSA), which improves traditional ABSA by modeling sentiment along valence-arousal (VA) dimensions rather than using categorical polarity labels. To extend ABSA beyond consumer reviews to public-issue discourse (e.g., political, energy, and climate issues), we introduce an additional task, Dimensional Stance Analysis (DimStance), which treats stance targets as aspects and reformulates stance detection as regression in the VA space. The task consists of two tracks: Track A (DimABSA) and Track B (DimStance). Track A includes three subtasks: (1) dimensional aspect sentiment regression, (2) dimensional aspect sentiment triplet extraction, and (3) dimensional aspect sentiment quadruplet extraction, while Track B includes only the regression subtask for stance targets. We also introduce a continuous F1 (cF1) metric to jointly evaluate structured extraction and VA regression. The task attracted more than 400 participants, resulting in 112 final submissions and 42 system description papers. We report baseline results, discuss top-performing systems, and analyze key design choices to provide insights into dimensional sentiment analysis at the aspect and stance-target levels. All resources are available on our GitHub repository.

关键词: Dimensional Aspect-Based Sentiment Analysis, Valence-Arousal dimensions, Stance detection, Continuous F1 metric, Shared task, SemEval-2026, Aspect sentiment regression, Public-issue discourse

133. ❌ Is Cross-Lingual Transfer in Bilingual Models Human-Like? A Study with Overlapping Word Forms in Dutch and English

作者: Iza Škrjanec, Irene Elisabeth Winther, Vera Demberg, Stefan L. Frank 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07067v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究双语语言模型（荷兰语-英语）的跨语言激活模式，与人类双语阅读进行对比。核心涉及Transformer模型的训练和评估，因此与’Large Language Models’（权重1.0）相关，但模型规模较小且专注于双语任务，给5分。模型通过不同词汇共享条件进行训练，涉及’Pre-training’（权重1.0）中的基础训练过程，给5分。研究通过分析模型行为（如surprisal和嵌入相似性）来理解其机制，与’Mechanistic Interpretability’（权重1.0）相关，给5分。其他关键词如MoE、SLMs、Scaling Laws、SFT、Alignment等均未在论文中涉及，给0分。AI for Science等应用领域关键词不相关，给0分。

!!! tip deepseek-chat TL;DR

该研究探讨了双语语言模型在跨语言激活方面是否与人类双语者相似，通过训练荷兰语-英语Transformer模型并分析其行为，发现模型仅在特定词汇共享条件下能部分再现人类模式，但其与人类处理的匹配度取决于词汇重叠的编码方式。

摘要翻译

双语使用者在阅读过程中表现出跨语言激活现象，尤其在处理具有相同表层形式的词汇时更为明显。同源词（friends）通常产生促进效应，而跨语言同形异义词（false friends）则引发干扰效应或无显著影响。本研究探讨双语语言模型中的跨语言激活是否反映这些模式。我们在四种词汇共享条件下训练荷兰语-英语因果Transformer模型，通过控制（假）同源词是否共享嵌入或使用语言特异性嵌入来操纵表征方式。利用双语阅读研究中的心理语言学刺激材料，我们通过惊奇度与嵌入相似性分析评估模型表现。模型总体上保持语言分离性，跨语言效应主要在嵌入共享条件下出现。在此类条件下，同源词与跨语言同形异义词相较于对照组均显示促进效应。回归分析表明，这些效应主要受词频驱动，而非形式-意义映射的一致性。仅当同源词单独共享嵌入时，模型才能复现双语者的定性加工模式。总体而言，双语语言模型能够捕捉部分跨语言激活效应，但其与人类加工机制的对齐程度关键取决于词汇重叠的编码方式，这可能限制其作为双语阅读模型的解释充分性。

摘要 (Abstract)

Bilingual speakers show cross-lingual activation during reading, especially for words with shared surface form. Cognates (friends) typically lead to facilitation, whereas interlingual homographs (false friends) cause interference or no effect. We examine whether cross-lingual activation in bilingual language models mirrors these patterns. We train Dutch-English causal Transformers under four vocabulary-sharing conditions that manipulate whether (false) friends receive shared or language-specific embeddings. Using psycholinguistic stimuli from bilingual reading studies, we evaluate the models through surprisal and embedding similarity analyses. The models largely maintain language separation, and cross-lingual effects arise primarily when embeddings are shared. In these cases, both friends and false friends show facilitation relative to controls. Regression analyses reveal that these effects are mainly driven by frequency rather than consistency in form-meaning mapping. Only when just friends share embeddings are the qualitative patterns of bilinguals reproduced. Overall, bilingual language models capture some cross-linguistic activation effects. However, their alignment with human processing seems to critically depend on how lexical overlap is encoded, possibly limiting their explanatory adequacy as models of bilingual reading.

关键词: bilingual language models, cross-lingual activation, Transformer models, vocabulary-sharing, psycholinguistic stimuli, embedding similarity, human processing comparison, Dutch-English

134. ❌ Multilingual Embedding Probes Fail to Generalize Across Learner Corpora

作者: Laurits Lyngbaek, Ross Deans Kristensen-McLachlan 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07095v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究多语言嵌入模型（Qwen3-Embedding）是否编码了语言能力的通用表示，通过探针分析隐藏状态激活来预测CEFR熟练度水平。这主要与"Large Language Models OR LLMs OR Foundation Models"相关（评分5分），因为Qwen3-Embedding属于基础模型家族，但论文重点在嵌入模型而非生成式LLM。与"Mechanistic Interpretability OR Explainable AI"相关（评分5分），因为探针分析是解释模型内部表示的方法。其他关键词如MoE、SFT、RAG、量化等均未涉及，评分为0分。

!!! tip deepseek-chat TL;DR

该论文通过探针分析发现，多语言嵌入模型（如Qwen3-Embedding）在预测语言熟练度时，其隐藏状态激活主要捕获语料库特定的分布特性，而非通用的、可迁移的语言能力表示，表明当前多语言嵌入模型并未直接编码语言通用的熟练度。

摘要翻译

多语言嵌入模型是否编码了语言能力的通用表征？为探究此问题，我们在Qwen3-Embedding（0.6B、4B、8B）的隐藏状态激活值上训练了线性和非线性探针，以基于九个语料库、七种语言的学习者文本预测其CEFR（欧洲语言共同参考框架）熟练度等级。我们比较了五种探针架构与基于表层文本特征的基线模型。在分布内评估中，探针表现出色（$QWK\approx0.7$），显著优于表层基线，且中间层始终提供最佳预测。然而，在跨语料库评估中，所有探针类型和模型尺寸的性能均大幅下降。残差分析表明，分布外探针趋于预测均匀分布的标签，这意味着所学映射捕捉的是语料库特定的分布属性（如主题、语言、任务类型、评分方法），而非抽象的、可迁移的熟练度维度。这些结果表明，当前的多语言嵌入并未直接编码语言通用的熟练度表征，这对基于表征的、适应熟练度的语言技术方法具有重要启示。

摘要 (Abstract)

Do multilingual embedding models encode a language-general representation of proficiency? We investigate this by training linear and non-linear probes on hidden-state activations from Qwen3-Embedding (0.6B, 4B, 8B) to predict CEFR proficiency levels from learner texts across nine corpora and seven languages. We compare five probing architectures against a baseline trained on surface-level text features. Under in-distribution evaluation, probes achieve strong performance ($QWK\approx0.7$), substantially outperforming the surface baseline, with middle layers consistently yielding the best predictions. However, in cross-corpus evaluation performance collapses across all probe types and model sizes. Residual analysis reveals that out-of-distribution probes converge towards predicting uniformly distributed labels, indicating that the learned mappings capture corpus-specific distributional properties (topic, language, task type, rating methodology) rather than an abstract, transferable proficiency dimension. These results suggest that current multilingual embeddings do not straightforwardly encode language-general proficiency, with implications for representation-based approaches to proficiency-adaptive language technology.

关键词: multilingual embedding models, proficiency prediction, CEFR levels, probe analysis, cross-corpus evaluation, Qwen3-Embedding, hidden-state activations, learner corpora

135. ❌ IndoBERT-Sentiment: Context-Conditioned Sentiment Classification for Indonesian Text

作者: Muhammad Apriandito Arya Saputra, Andry Alamsyah, Dian Puteri Ramadhani, Thomhert Suprapto Siadari, Hanif Fakhrurroja 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07057v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于印尼语情感分类任务，提出了一种基于IndoBERT Large（335M参数）的上下文条件化情感分类模型。论文的核心是特定领域（印尼语NLP）的应用研究，而非大模型技术原理的创新。与评分关键词的相关性分析如下：1）仅与"Post-training OR Supervised Fine-tuning OR SFT"有一定关联（5分），因为论文提到在IndoBERT Large上进行训练，这属于监督微调范畴，但并非论文的创新重点；2）其他关键词均与大模型技术原理、创新方法或跨领域应用无关，因此评分为0分。论文未涉及大模型在科学领域的创新应用，也未展示任何评分关键词中的技术创新。

!!! tip deepseek-chat TL;DR

该论文针对印尼语情感分析中忽略上下文导致误判的问题，提出了IndoBERT-Sentiment模型，通过引入话题上下文作为输入，在188个主题的31,360个样本上训练，取得了0.856的F1宏平均和88.1%的准确率，比最佳基线模型提升了35.6个F1点。

摘要翻译

现有印度尼西亚语情感分析模型孤立地对文本进行分类，忽略了通常决定语句情感倾向（积极、消极或中性）的主题语境。我们提出IndoBERT-Sentiment——一种语境条件情感分类器，该模型同时接收主题语境和文本作为输入，基于所讨论的主题生成情感预测。该模型基于IndoBERT Large（3.35亿参数）架构构建，在涵盖188个主题的31,360个语境-文本标注对上进行训练，最终实现宏观F1值0.856和准确率88.1%。在与三种广泛使用的通用印尼语情感模型在同一测试集上的直接对比评估中，IndoBERT-Sentiment以35.6个F1值的优势超越最佳基线模型。研究表明，先前在相关性分类任务中得到验证的语境条件机制，能有效迁移至情感分析领域，使模型能够正确分类那些被无视语境的方法系统性误判的文本。

摘要 (Abstract)

Existing Indonesian sentiment analysis models classify text in isolation, ignoring the topical context that often determines whether a statement is positive, negative, or neutral. We introduce IndoBERT-Sentiment, a context-conditioned sentiment classifier that takes both a topical context and a text as input, producing sentiment predictions grounded in the topic being discussed. Built on IndoBERT Large (335M parameters) and trained on 31,360 context-text pairs labeled across 188 topics, the model achieves an F1 macro of 0.856 and accuracy of 88.1%. In a head-to-head evaluation against three widely used general-purpose Indonesian sentiment models on the same test set, IndoBERT-Sentiment outperforms the best baseline by 35.6 F1 points. We show that context-conditioning, previously demonstrated for relevancy classification, transfers effectively to sentiment analysis and enables the model to correctly classify texts that are systematically misclassified by context-free approaches.

关键词: Indonesian sentiment analysis, context-conditioned classification, IndoBERT Large, topic-aware sentiment, supervised fine-tuning, F1 macro score, baseline comparison, text classification

136. ❌ Sell More, Play Less: Benchmarking LLM Realistic Selling Skill

作者: Xuanbo Su, Wenhao Hu, Le Zhan, Yanqi Yang, Leo Huang 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07054v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	8.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在销售对话场景中的应用，属于LLM Agents领域（高度相关10分）。论文训练CustomerLM时使用了SFT和DPO方法（相关度8分）。论文主要评估主流LLM在销售对话中的表现，属于LLM应用研究（高度相关10分）。其他关键词如MoE、Scaling Laws、RAG、Quantization等与论文内容无直接关联（0分）。

!!! tip deepseek-chat TL;DR

该论文针对销售对话场景，提出了SalesLLM基准和自动评估框架，发现不同LLM在销售技能上表现差异显著，最佳模型能达到人类水平。

摘要翻译

销售对话需要在非对称激励下进行多轮次、目标导向的说服，这对大语言模型构成了独特挑战。然而现有对话基准很少衡量交易进展与最终结果。我们推出SalesLLM——一个源于金融服务和消费品实际应用场景的双语（中/英）基准，其构建基于30,074个脚本化配置和1,805个精心设计的可调控难度与人设的多轮对话场景。我们提出全自动评估流程，整合了（i）基于LLM的销售流程进展评估器，以及（ii）用于对话终端购买意图预测的微调BERT分类器。为提升模拟真实度，我们通过监督微调（SFT）和直接偏好优化（DPO）在8,000场众包人员参与的销售对话上训练用户模型CustomerLM，将角色反转率从GPT-4o的17.44%降至8.8%。SalesLLM评分与专家人工评估呈现强相关性（皮尔逊系数r=0.98）。对15个主流大语言模型的实验显示显著性能差异：顶尖模型达到人类水平，而较弱模型表现逊于人类。SalesLLM可作为开发与评估结果导向型销售智能体的可扩展基准。

摘要 (Abstract)

Sales dialogues require multi-turn, goal-directed persuasion under asymmetric incentives, which makes them a challenging setting for large language models (LLMs). Yet existing dialogue benchmarks rarely measure deal progression and outcomes. We introduce SalesLLM, a bilingual (ZH/EN) benchmark derived from realistic applications covering Financial Services and Consumer Goods, built from 30,074 scripted configurations and 1,805 curated multi-turn scenarios with controllable difficulty and personas. We propose a fully automatic evaluation pipeline that combines (i) an LLM-based rater for sales-process progress, and (ii) fine-tuned BERT classifiers for end-of-dialogue buying intent. To improve simulation fidelity, we train a user model, CustomerLM, with SFT and DPO on 8,000 crowdworker-involved sales conversations, reducing role inversion from 17.44% (GPT-4o) to 8.8%. SalesLLM scores correlate strongly with expert human ratings (Pearson r=0.98). Experiments across 15 mainstream LLMs reveal substantial variability: top-performance LLMs are competitive with human-level performance while the less capable ones are worse than human. SalesLLM serves as a scalable benchmark for developing and evaluating outcome-oriented sales agents.

关键词: Sales dialogues, LLM benchmark, Multi-turn scenarios, Automatic evaluation, CustomerLM, SFT, DPO, Sales agents

137. ❌ Gemma 4, Phi-4, and Qwen3: Accuracy-Efficiency Tradeoffs in Dense and MoE Reasoning Language Models

作者: Md Motaleb Hossen Manik, Ge Wang 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07035v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	5.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究MoE与dense语言模型在推理任务上的准确率-效率权衡，直接涉及LLMs、MoE架构、指令调优和思维链推理等关键词（10分）。论文评估推理性能，与系统2思维相关（5分）；包含few-shot学习（5分）；评估TruthfulQA涉及事实性（5分）；测量延迟和FLOPs涉及推理加速（5分）；评估模型包括较小参数模型（5分）。其他关键词如数据质量、预训练、对齐方法、RAG、压缩技术等未在研究中涉及（0分）。

!!! tip deepseek-chat TL;DR

该研究通过系统基准测试比较了七种密集和MoE推理语言模型在四种任务上的准确率-效率权衡，发现稀疏激活本身不能保证最佳实践操作点，实际权衡取决于架构、提示策略和任务组合。

摘要翻译

专家混合（Mixture-of-Experts, MoE）语言模型通常被认为能提供比稠密模型更优的质量-效率权衡，因为每个词元仅激活参数的一个子集，但该优势的实际价值取决于现实推理约束下的端到端行为。我们提出了一个受控的实证基准测试，涵盖七种近期面向推理的指令微调模型，包括稠密与MoE设计，即Gemma-4-E2B、Gemma-4-E4B、Gemma-4-26B-A4B、Phi-4-mini-reasoning、Phi-4-reasoning、Qwen3-8B和Qwen3-30B-A3B。这些模型在四个基准测试——ARC-Challenge、GSM8K、Math Level 1-3和TruthfulQA MC1——上通过三种提示策略进行评估：零样本、思维链和少样本思维链。本研究共涵盖8,400次模型-数据集-提示组合评估，并记录了准确率、延迟、峰值GPU内存使用量（VRAM）以及近似的每词元浮点运算次数（FLOPs）代理指标。在加权多任务综合评估中，采用少样本思维链的Gemma-4-E4B取得了最佳整体结果，加权准确率达到0.675，平均VRAM为14.9 GB；而Gemma-4-26B-A4B准确率接近（0.663），但内存占用显著更高，达48.1 GB。在任务层面，Gemma系列模型在ARC和数学基准上表现领先，Phi系列模型在TruthfulQA上最强，GSM8K则显示出最大的提示敏感性——例如Phi-4-reasoning的准确率从思维链下的0.67骤降至少样本思维链下的0.11。这些结果表明，仅靠稀疏激活并不能保证最佳的实际操作点：观察到的准确率-效率权衡共同取决于架构、提示策略和任务构成。我们发布了可复现的基准测试流程、汇总结果及配套统计分析，以支持在真实资源约束下对推理大语言模型进行面向部署的评估。

摘要 (Abstract)

Mixture-of-experts (MoE) language models are often expected to offer better quality-efficiency tradeoffs than dense models because only a subset of parameters is activated per token, but the practical value of that advantage depends on end-to-end behavior under realistic inference constraints. We present a controlled empirical benchmark of seven recent reasoning-oriented instruction-tuned models spanning dense and MoE designs, namely Gemma-4-E2B, Gemma-4-E4B, Gemma-4-26B-A4B, Phi-4-mini-reasoning, Phi-4-reasoning, Qwen3-8B, and Qwen3-30B-A3B, evaluated on four benchmarks – ARC-Challenge, GSM8K, Math Level 1-3, and TruthfulQA MC1 – under three prompting strategies: zero-shot, chain-of-thought, and few-shot chain-of-thought. The study covers 8,400 total model-dataset-prompt evaluations and records accuracy, latency, peak GPU memory usage (VRAM), and an approximate floating-point operations (FLOPs)-per-token proxy. Across the weighted multi-task summary, Gemma-4-E4B with few-shot chain-of-thought achieved the best overall result, reaching weighted accuracy 0.675 with mean VRAM 14.9 GB, while Gemma-4-26B-A4B was close in accuracy at 0.663 but substantially more memory intensive at 48.1 GB. At the task level, Gemma models dominated ARC and Math, Phi models were strongest on TruthfulQA, and GSM8K showed the largest prompt sensitivity, including a sharp drop for Phi-4-reasoning from 0.67 under chain-of-thought to 0.11 under few-shot chain-of-thought. These results show that sparse activation alone does not guarantee the best practical operating point: observed accuracy-efficiency tradeoffs depend jointly on architecture, prompting protocol, and task composition. We release a reproducible benchmark pipeline, aggregated results, and paired statistical analyses to support deployment-oriented evaluation of reasoning LLMs under real resource constraints.

关键词: Mixture-of-experts, dense models, reasoning language models, accuracy-efficiency tradeoffs, instruction-tuned models, chain-of-thought, benchmark evaluation, inference constraints

138. ❌ MARS: Enabling Autoregressive Models Multi-Token Generation

作者: Ziqi Jin, Lei Wang, Ziwei Luo, Aixin Sun 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07023v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	10.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出MARS方法，通过轻量级微调使指令调优的自回归模型能够每前向传递预测多个token，属于大模型推理加速技术。核心相关关键词：1) 基于指令调优模型（Instruction Tuning），2) 使用微调方法（Post-training/SFT），3) 参数高效微调（PEFT，因无架构修改/额外参数），4) 推理加速（Speculative Decoding），5) KV缓存优化（KV Cache Compression），6) 大模型应用（LLMs）。其他关键词如MoE、量化、RAG等未涉及。

!!! tip deepseek-chat TL;DR

论文提出MARS方法，通过轻量级微调使指令调优的自回归语言模型能够每前向传递预测多个token，在保持准确性的同时实现1.5-1.7倍吞吐量提升，并开发了块级KV缓存策略进一步加速推理。

摘要翻译

自回归（AR）语言模型每次前向传播仅生成一个词元，即使在给定上文语境后连续词元具有高度可预测性时亦是如此。本文提出MARS（掩码自回归）方法，这是一种轻量级微调技术，能够指导经过指令调优的AR模型在单次前向传播中预测多个词元。MARS无需修改模型架构，不增加额外参数，且生成的单一模型仍可完全按照原始AR模型方式调用而不会造成性能损失。与需要额外维护草稿模型的推测解码技术不同，亦不同于Medusa等多头预测方案需要附加预测头，MARS仅需在现有指令数据上进行持续训练。当每次前向传播生成一个词元时，MARS在六项标准基准测试中达到或超越了AR基线性能。当允许单步接受多个词元时，该方法在保持基线级别准确性的同时实现了1.5-1.7倍的吞吐量提升。我们进一步开发了面向批量推理的块级KV缓存策略，在Qwen2.5-7B模型上结合KV缓存实现了较AR基线最高1.71倍的实时加速。最后，MARS支持通过置信度阈值进行实时速度调节：在高请求负载场景下，服务系统无需切换模型或重启即可动态提升吞吐量，为实际部署提供了可调节延迟与质量的实用控制机制。

摘要 (Abstract)

Autoregressive (AR) language models generate text one token at a time, even when consecutive tokens are highly predictable given earlier context. We introduce MARS (Mask AutoRegreSsion), a lightweight fine-tuning method that teaches an instruction-tuned AR model to predict multiple tokens per forward pass. MARS adds no architectural modifications, no extra parameters, and produces a single model that can still be called exactly like the original AR model with no performance degradation. Unlike speculative decoding, which maintains a separate draft model alongside the target, or multi-head approaches such as Medusa, which attach additional prediction heads, MARS requires only continued training on existing instruction data. When generating one token per forward pass, MARS matches or exceeds the AR baseline on six standard benchmarks. When allowed to accept multiple tokens per step, it maintains baseline-level accuracy while achieving 1.5-1.7x throughput. We further develop a block-level KV caching strategy for batch inference, achieving up to 1.71x wall-clock speedup over AR with KV cache on Qwen2.5-7B. Finally, MARS supports real-time speed adjustment via confidence thresholding: under high request load, the serving system can increase throughput on the fly without swapping models or restarting, providing a practical latency-quality knob for deployment.

关键词: Autoregressive Models, Multi-Token Generation, Instruction Tuning, Fine-tuning, Inference Acceleration, KV Cache, Throughput Optimization, Lightweight Method

139. ❌ Corpora deduplication or duplication in Natural Language Processing of few resourced languages ? A case of study: The Mexico’s Nahuatl

作者: Juan-José Guzman-Landa, Juan-Manuel Torres-Moreno, Graham Ranger, Miguel Figueroa-Saavedra, Martha-Lorena Avendaño-Garrido, Elvys Linhares-Pontes, Luis-Gil Moreno-Jiménez 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07015v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究低资源语言（纳瓦特尔语）的语料库扩展技术，通过受控复制来增加训练数据，以改善NLP任务（如语义相似性）中的嵌入学习。该研究与’Large Language Models’相关（8分），因为论文明确提到为训练大语言模型扩展语料库；与’Scaling Laws AND Data Quality’相关（5分），因为涉及数据质量和扩展方法；与’Pre-training’相关（5分），因为嵌入学习是预训练的一部分。其他关键词（如MoE、SFT、RAG等）未在论文中涉及，因此得0分。

!!! tip deepseek-chat TL;DR

该论文研究了在低资源语言（纳瓦特尔语）中，通过受控复制语料库来扩展数据，以改善NLP任务中的嵌入学习，实验结果显示增量复制技术能适度提升语义相似性任务的性能。

摘要翻译

本文旨在探讨以下问题：对于计算资源有限的语言，数据复制是否有助于自然语言处理（NLP）？在这类语言（或称$π$-语言）中，可用于训练大语言模型（Large Language Models）的语料库几乎不存在。我们将以纳瓦特尔语（Nawatl）为例，具体研究语料库扩展的影响。纳瓦特尔语是一种黏着式多式综合语（agglutinative and polysynthetic $π$-language），使用人口超过200万，且拥有大量方言变体。本研究旨在通过受控复制的方式，扩展新的$π$-yalli语料库——该库目前仅包含有限数量的纳瓦特尔语文本。实验中，我们将采用增量复制技术（incremental duplication），目标是学习适用于NLP任务的词嵌入（embeddings）。为此，我们训练了静态嵌入（static embeddings），并在句子级语义相似度任务中进行了评估。结果显示，与仅使用未扩展语料库相比，采用增量复制技术后性能得到了适度提升。此外，据我们所知，该技术尚未在现有文献中被应用。

摘要 (Abstract)

In this article, we seek to answer the following question: could data duplication be useful in Natural Language Processing (NLP) for languages with limited computational resources? In this type of languages (or $π$-languages), corpora available for training Large Language Models are virtually non-existent. In particular, we will study the impact of corpora expansion in Nawatl, an agglutinative and polysynthetic $π$-language spoken by over 2 million people, with a large number of dialectal varieties. The aim is to expand the new $π$-yalli corpus, which contains a limited number of Nawatl texts, by duplicating it in a controlled way. In our experiments, we will use the incremental duplication technique. The aim is to learn embeddings that are well-suited to NLP tasks. Thus, static embeddings were trained and evaluated in a sentence-level semantic similarity task. Our results show a moderate improvement in performance when using incremental duplication compared to the results obtained using only the corpus without expansion. Furthermore, to our knowledge, this technique has not yet been used in the literature.

关键词: corpora duplication, low-resource languages, Nahuatl, embedding learning, semantic similarity, incremental duplication, NLP, π-languages

140. ❌ DTCRS: Dynamic Tree Construction for Recursive Summarization

作者: Guanran Luo, Zhongquan Jian, Wentao Qiu, Meihong Wang, Qingqiang Wu 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07012v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究RAG方法在LLMs中的应用，直接涉及’Retrieval-Augmented Generation’和’Large Language Models’关键词，得10分；论文旨在缓解LLMs的幻觉问题，与’Hallucination Mitigation’高度相关，得8分；论文处理涉及多步推理的抽象问题，与’Chain of Thought’有一定关联，得5分；其他关键词在论文中未提及或无关，得0分。

!!! tip deepseek-chat TL;DR

论文提出DTCRS方法，通过动态生成基于文档结构和查询语义的摘要树来优化RAG中的递归摘要，显著减少构建时间并提升问答性能，同时分析了递归摘要对不同问题类型的适用性。

摘要翻译

检索增强生成（Retrieval-Augmented Generation，RAG）通过引入外部知识缓解了大语言模型（Large Language Models，LLMs）的幻觉问题。递归摘要通过聚类文本块构建层次化摘要树，整合文档多个部分的信息，从而为涉及多步推理的抽象性问题提供证据支持。然而，摘要树通常包含大量冗余的摘要节点，这不仅增加了构建时间，还可能对问答效果产生负面影响。此外，递归摘要并非适用于所有类型的问题。本文提出DTCRS方法，该方法基于文档结构和查询语义动态生成摘要树。DTCRS通过分析问题类型来判断是否需要构建摘要树，随后对问题进行分解，并利用子问题的嵌入向量作为初始聚类中心，从而减少冗余摘要，同时提升摘要与问题之间的相关性。我们的方法显著降低了摘要树构建时间，并在三项问答任务中取得了实质性提升。此外，我们还探究了递归摘要对不同问题类型的适用性，为未来研究提供了有价值的见解。

摘要 (Abstract)

Retrieval-Augmented Generation (RAG) mitigates the hallucination problem of Large Language Models (LLMs) by incorporating external knowledge. Recursive summarization constructs a hierarchical summary tree by clustering text chunks, integrating information from multiple parts of a document to provide evidence for abstractive questions involving multi-step reasoning. However, summary trees often contain a large number of redundant summary nodes, which not only increase construction time but may also negatively impact question answering. Moreover, recursive summarization is not suitable for all types of questions. We introduce DTCRS, a method that dynamically generates summary trees based on document structure and query semantics. DTCRS determines whether a summary tree is necessary by analyzing the question type. It then decomposes the question and uses the embeddings of sub-questions as initial cluster centers, reducing redundant summaries while improving the relevance between summaries and the question. Our approach significantly reduces summary tree construction time and achieves substantial improvements across three QA tasks. Additionally, we investigate the applicability of recursive summarization to different question types, providing valuable insights for future research.

关键词: Retrieval-Augmented Generation, Large Language Models, recursive summarization, hallucination mitigation, dynamic tree construction, question answering, multi-step reasoning, summary tree optimization

141. ❌ ChunQiuTR: Time-Keyed Temporal Retrieval in Classical Chinese Annals

作者: Yihao Wang, Zijian He, Jie Ren, Keze Wang 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06997v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于检索增强生成（RAG）在历史研究中的应用，特别是针对古典中文编年史的时间关键检索。论文的核心贡献是提出了ChunQiuTR基准和CTD模型，旨在解决RAG中时间一致性的挑战。因此，与"Retrieval-Augmented Generation OR RAG OR Retrieval-Generation"高度相关（10分），因为这是论文的核心技术框架。与"Large Language Models OR LLMs OR Foundation Models"有一定关联（5分），因为RAG通常与LLMs结合使用，但论文未深入探讨LLMs本身。与"AI for Science OR Bioinformatics OR Cheminformatics"有一定关联（5分），因为论文将AI应用于历史研究（可视为科学领域的一个分支）。其他关键词（如MoE、SFT、RLHF等）与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文针对古典中文编年史中时间关键检索的挑战，提出了ChunQiuTR基准和CTD模型，以提升检索增强生成在历史研究中的时间一致性。

摘要翻译

检索决定了语言模型在检索增强生成（RAG）中如何获取和基于知识。在历史研究中，目标往往并非任意相关的段落，而是特定统治月份的确切记录，其中时间一致性与主题相关性同等重要。这对于中国古典编年史尤其具有挑战性，因为时间是通过简洁、隐含、非公历的纪年短语表达的，必须从上下文语境中解读，因此语义上看似合理的证据在时间上仍可能无效。我们提出了 ChunQiuTR，一个基于《春秋》及其注释传统构建的、以时间为键的检索基准。ChunQiuTR 以月份级别的纪年键组织记录，并包含了反映现实检索失败的时间邻近混淆项。我们进一步提出了 CTD（历法时间双编码器），一种时间感知的双编码器，它结合了基于傅里叶变换的绝对历法上下文与相对偏移偏置。实验表明，在时间键控评估下，该方法相较于强大的语义双编码器基线模型取得了持续的性能提升，这支持了检索时的时间一致性是下游可靠历史RAG的关键前提。我们的代码和数据集可在 \href{https://github.com/xbdxwyh/ChunQiuTR}{\texttt{github.com/xbdxwyh/ChunQiuTR}} 获取。

摘要 (Abstract)

Retrieval shapes how language models access and ground knowledge in retrieval-augmented generation (RAG). In historical research, the target is often not an arbitrary relevant passage, but the exact record for a specific regnal month, where temporal consistency matters as much as topical relevance. This is especially challenging for Classical Chinese annals, where time is expressed through terse, implicit, non-Gregorian reign phrases that must be interpreted from surrounding context, so semantically plausible evidence can still be temporally invalid. We introduce \textbf{ChunQiuTR}, a time-keyed retrieval benchmark built from the \textit{Spring and Autumn Annals} and its exegetical tradition. ChunQiuTR organizes records by month-level reign keys and includes chrono-near confounders that mirror realistic retrieval failures. We further propose \textbf{CTD} (Calendrical Temporal Dual-encoder), a time-aware dual-encoder that combines Fourier-based absolute calendrical context with relative offset biasing. Experiments show consistent gains over strong semantic dual-encoder baselines under time-keyed evaluation, supporting retrieval-time temporal consistency as a key prerequisite for faithful downstream historical RAG. Our code and datasets are available at \href{https://github.com/xbdxwyh/ChunQiuTR}{\texttt{github.com/xbdxwyh/ChunQiuTR}}.

关键词: Retrieval-Augmented Generation, Temporal Retrieval, Classical Chinese Annals, Time-keyed Retrieval, Dual-encoder, Historical Research, Temporal Consistency, Benchmark

142. ❌ iTAG: Inverse Design for Natural Text Generation with Accurate Causal Graph Annotations

作者: Wenshuo Wang, Boyu Cao, Nan Zhuang, Wei Li 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06902v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心是使用LLMs生成带有因果图标注的自然文本，直接涉及’Large Language Models’和’Chain of Thought’（CoT推理用于迭代优化概念选择）。‘System 2 Thinking’和’Self-Correction’相关，因为iTAG涉及迭代检查和精炼。‘Hallucination Mitigation’相关，因为方法旨在提高标注准确性。‘AI for Science’相关，因为研究支持科学领域的因果发现。其他关键词如MoE、SFT、RAG等未在论文中提及或应用。

!!! tip deepseek-chat TL;DR

论文提出iTAG方法，通过结合真实世界概念分配和Chain-of-Thought推理，使用大型语言模型生成具有高准确因果图标注的自然文本，解决了现有方法在文本自然性和标注准确性之间的权衡问题。

摘要翻译

从文本中进行因果发现的一个根本障碍在于，由于高昂的标注成本，缺乏可作为基准事实的因果标注文本数据。这催生了一项重要任务：生成带有因果图标注的文本。早期的基于模板的生成方法以牺牲文本自然度为代价，换取较高的因果图标注准确性。近期依赖大语言模型（LLM）的方法通过LLM直接从目标图生成自然文本，但无法保证因果图标注的准确性。因此，我们提出了iTAG方法，它在现有依赖LLM的方法将因果图转换为文本之前，先为节点分配现实世界中的概念。iTAG将这一过程构建为一个以因果图为目标的逆问题，通过思维链（Chain-of-Thought, CoT）推理迭代检查和优化概念选择，使得概念之间推导出的关系尽可能与因果图所描述的目标因果关系保持一致。在大量测试中，iTAG同时展现出极高的标注准确性和文本自然度，并且使用生成数据测试基于文本的因果发现算法的结果显示，其与真实世界数据具有高度的统计相关性。这表明，iTAG生成的数据可以作为基于文本的因果发现算法进行可扩展基准测试的有效替代品。

摘要 (Abstract)

A fundamental obstacle to causal discovery from text is the lack of causally annotated text data for use as ground truth, due to high annotation costs. This motivates an important task of generating text with causal graph annotations. Early template-based generation methods sacrifice text naturalness in exchange for high causal graph annotation accuracy. Recent Large Language Model (LLM)-dependent methods directly generate natural text from target graphs through LLMs, but do not guarantee causal graph annotation accuracy. Therefore, we propose iTAG, which performs real-world concept assignment to nodes before converting causal graphs into text in existing LLM-dependent methods. iTAG frames this process as an inverse problem with the causal graph as the target, iteratively examining and refining concept selection through Chain-of-Thought (CoT) reasoning so that the induced relations between concepts are as consistent as possible with the target causal relationships described by the causal graph. iTAG demonstrates both extremely high annotation accuracy and naturalness across extensive tests, and the results of testing text-based causal discovery algorithms with the generated data show high statistical correlation with real-world data. This suggests that iTAG-generated data can serve as a practical surrogate for scalable benchmarking of text-based causal discovery algorithms.

关键词: causal discovery, text generation, Large Language Models, Chain-of-Thought reasoning, causal graph annotations, inverse design, natural text, benchmarking

143. ❌ To Adapt or not to Adapt, Rethinking the Value of Medical Knowledge-Aware Large Language Models

作者: Ane G. Domingo-Aldama, Iker De La Iglesia, Maitane Urruela, Aitziber Atutxa, Ander Barrena 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06854v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究大语言模型在医学领域的应用，特别是通过持续领域自适应预训练开发临床LLM，并评估其与通用模型的性能对比。因此与’Large Language Models’、‘Pre-training/Continual Pre-training/Domain Adaptation’和’AI for Science/Bioinformatics/Cheminformatics’高度相关（10分）。其他关键词如MoE、SFT、RAG、推理方法、代理、压缩等均未在摘要中提及，故评0分。

!!! tip deepseek-chat TL;DR

该研究评估了临床领域自适应大语言模型在医学问答任务上的表现，发现其在英语任务中相比通用模型提升有限且不稳定，但在西班牙语任务中开发的Marmoka模型表现更好，表明现有评估框架可能不足以捕捉真正的医学专业知识。

摘要翻译

背景：近期研究表明，在标准医学基准测试中，经过领域适配的大型语言模型（LLM）的表现并未持续优于通用模型，这引发了对专业化临床适配必要性的质疑。
方法：我们系统比较了通用与临床LLM在英语和西班牙语多种临床选择题任务上的表现。引入了一种基于扰动的评估基准，用于探究模型的鲁棒性、指令遵循能力以及对对抗性变体的敏感性。评估框架包括单步与双步问题转换、多提示测试及指令引导评估。我们分析了一系列前沿临床模型及其通用对应版本，重点关注基于Llama 3.1的模型。此外，我们提出了Marmoka系列——一组面向英语和西班牙语的轻量级80亿参数临床LLM，该模型通过对医学语料和指令进行持续领域自适应预训练而开发。
结果：实验表明，即使在提出的基于扰动的基准测试下，临床LLM在英语临床任务中也未能持续超越通用模型。然而，在西班牙语子集中，所提出的Marmoka模型相比Llama取得了更优结果。
结论：研究结果表明，在当前短形式多项选择题问答（MCQA）基准下，临床LLM相较于英语通用模型仅能提供有限且不稳定的性能提升，这暗示现有评估框架可能不足以捕捉真正的医学专业能力。我们进一步发现，通用模型与临床模型在指令遵循和严格输出格式方面均存在显著局限。最后，我们通过Marmoka模型证明，能够成功为西班牙语等低资源语言开发出鲁棒的医学LLM。

摘要 (Abstract)

BACKGROUND: Recent studies have shown that domain-adapted large language models (LLMs) do not consistently outperform general-purpose counterparts on standard medical benchmarks, raising questions about the need for specialized clinical adaptation. METHODS: We systematically compare general and clinical LLMs on a diverse set of multiple choice clinical question answering tasks in English and Spanish. We introduce a perturbation based evaluation benchmark that probes model robustness, instruction following, and sensitivity to adversarial variations. Our evaluation includes, one-step and two-step question transformations, multi prompt testing and instruction guided assessment. We analyze a range of state-of-the-art clinical models and their general-purpose counterparts, focusing on Llama 3.1-based models. Additionally, we introduce Marmoka, a family of lightweight 8B-parameter clinical LLMs for English and Spanish, developed via continual domain-adaptive pretraining on medical corpora and instructions. RESULTS: The experiments show that clinical LLMs do not consistently outperform their general purpose counterparts on English clinical tasks, even under the proposed perturbation based benchmark. However, for the Spanish subsets the proposed Marmoka models obtain better results compared to Llama. CONCLUSIONS: Our results show that, under current short-form MCQA benchmarks, clinical LLMs offer only marginal and unstable improvements over general-purpose models in English, suggesting that existing evaluation frameworks may be insufficient to capture genuine medical expertise. We further find that both general and clinical models exhibit substantial limitations in instruction following and strict output formatting. Finally, we demonstrate that robust medical LLMs can be successfully developed for low-resource languages such as Spanish, as evidenced by the Marmoka models.

关键词: large language models, clinical adaptation, domain-adaptive pretraining, medical question answering, evaluation benchmark, robustness, instruction following, low-resource languages

144. ❌ Fast-dVLM: Efficient Block-Diffusion VLM via Direct Conversion from Autoregressive VLM

作者: Chengyue Wu, Shiyi Lan, Yonggan Fu, Sensen Gao, Jin Wang, Jincheng Yu, Jose M. Alvarez, Pavlo Molchanov, Ping Luo, Song Han, Ligeng Zhu, Enze Xie 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06832v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	8.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	10.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究视觉语言模型（VLM）的推理加速技术，与LLM技术高度相关（10分）。主要贡献包括：1）KV-cache兼容的并行解码和推测块解码（与’KV Cache Compression’和’Speculative Decoding’高度相关，各10分）；2）FP8量化（与’Quantization’高度相关，10分）；3）针对边缘设备部署（与’Small Language Models’相关，8分）；4）采用直接转换方法进行模型调整（与’Post-training’相关，8分）；5）涉及预训练模型适配（与’Pre-training’相关，5分）。其他关键词如MoE、RAG、Alignment等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对视觉语言模型在边缘设备上自回归解码效率低的问题，提出了Fast-dVLM模型，通过块扩散方法实现并行解码和推测解码，在保持生成质量的同时实现了超过6倍的端到端推理加速。

摘要翻译

视觉语言模型（VLMs）主要依赖自回归解码方式，即逐词元生成文本，这从根本上限制了推理吞吐量。该限制在机器人学和自动驾驶等物理人工智能场景中尤为突出，因为此类场景中的VLMs通常以批大小为1的形式部署在边缘设备上，导致自回归解码受限于内存带宽，硬件并行能力未能充分利用。尽管基于分块的离散扩散方法已在并行文本生成中展现出潜力，但将其扩展至VLMs仍面临挑战，因为需要同时处理连续的视觉表征和离散的文本词元，并保持预训练的多模态能力。本文提出Fast-dVLM，一种基于分块扩散的视觉语言模型，它支持兼容KV缓存的并行解码与推测性分块解码，以实现推理加速。我们系统比较了两种自回归到扩散的转换策略：一种是两阶段方法，先在纯文本扩散微调中调整大语言模型骨干网络，再进行多模态训练；另一种是直接方法，单阶段完成整个自回归视觉语言模型的转换。在可比的训练成本下，直接转换因能利用已实现多模态对齐的视觉语言模型而显著更高效，因此我们将其作为推荐方案。我们引入了一系列多模态扩散适配技术，包括分块大小退火、因果上下文注意力、自动截断掩码和视觉高效拼接，这些技术共同实现了视觉语言模型环境下有效的分块扩散。在11个多模态基准上的大量实验表明，Fast-dVLM在生成质量上与其自回归版本相当。通过集成SGLang并采用FP8量化，Fast-dVLM实现了超过自回归基线6倍的端到端推理加速。

摘要 (Abstract)

Vision-language models (VLMs) predominantly rely on autoregressive decoding, which generates tokens one at a time and fundamentally limits inference throughput. This limitation is especially acute in physical AI scenarios such as robotics and autonomous driving, where VLMs are deployed on edge devices at batch size one, making AR decoding memory-bandwidth-bound and leaving hardware parallelism underutilized. While block-wise discrete diffusion has shown promise for parallel text generation, extending it to VLMs remains challenging due to the need to jointly handle continuous visual representations and discrete text tokens while preserving pretrained multimodal capabilities. We present Fast-dVLM, a block-diffusion-based VLM that enables KV-cache-compatible parallel decoding and speculative block decoding for inference acceleration. We systematically compare two AR-to-diffusion conversion strategies: a two-stage approach that first adapts the LLM backbone with text-only diffusion fine-tuning before multimodal training, and a direct approach that converts the full AR VLM in one stage. Under comparable training budgets, direct conversion proves substantially more efficient by leveraging the already multimodally aligned VLM; we therefore adopt it as our recommended recipe. We introduce a suite of multimodal diffusion adaptations, block size annealing, causal context attention, auto-truncation masking, and vision efficient concatenation, that collectively enable effective block diffusion in the VLM setting. Extensive experiments across 11 multimodal benchmarks show Fast-dVLM matches its autoregressive counterpart in generation quality. With SGLang integration and FP8 quantization, Fast-dVLM achieves over 6x end-to-end inference speedup over the AR baseline.

关键词: Vision-language models, Autoregressive decoding, Block diffusion, KV-cache, Parallel decoding, Speculative decoding, Inference acceleration, Edge devices

145. ❌ SemEval-2026 Task 9: Detecting Multilingual, Multicultural and Multievent Online Polarization

作者: Usman Naseem, Robert Geislinger, Juan Ren, Sarah Kohail, Rudy Garrido Veliz, P Sam Sahil, Yiran Zhang, Marco Antonio Stranisci, Idris Abdulmumin, Özge Alaçam, Cengiz Acartürk, Aisha Jabr, Saba Anwar, Abinew Ali Ayele, Elena Tutubalina, Aung Kyaw Htet, Xintong Wang, Surendrabikram Thapa, Tanmoy Chakraborty, Dheeraj Kodati, Sahar Moradizeyveh, Firoj Alam, Ye Kyaw Thu, Shantipriya Parida, Ihsan Ayyub Qazi, Lilian Wanzare, Nelson Odhiambo Onyango, Clemencia Siro, Ibrahim Said Ahmad, Adem Chanie Ali, Martin Semmann, Chris Biemann, Shamsuddeen Hassan Muhammad, Seid Muhie Yimam 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06817v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文描述了一个SemEval共享任务，专注于在线极化检测，涉及多语言数据集、任务设计和结果分析。论文内容与所有评分关键词（主要关于大模型技术原理、训练方法、推理优化、应用等）完全无关，没有涉及任何大模型、深度学习技术或AI for Science的具体研究。

!!! tip deepseek-chat TL;DR

该论文介绍了SemEval-2026 Task 9共享任务，旨在检测多语言在线极化内容，并报告了基线结果和最佳系统的性能分析。

摘要翻译

本文介绍SemEval-2026 Task 9——一项涵盖22种语言、包含超过11万条标注实例的在线极化检测共享任务。每个数据实例均采用多标签标注，包含极化存在性、极化类型及极化表现形式。参与者需在三个子任务中进行标签预测：(1) 检测极化是否存在，(2) 识别极化类型，(3) 辨识极化表现形式。该任务吸引了全球超过1000名参与者，在Codabench平台上收到逾1万次提交。我们最终收到来自67个团队的提交结果及73篇系统描述论文。本文报告了基线结果，分析了各子任务和语言中最佳系统的性能表现，并重点阐述了最常用的方法及最有效的技术路径。本任务数据集已公开提供。

摘要 (Abstract)

We present SemEval-2026 Task 9, a shared task on online polarization detection, covering 22 languages and comprising over 110K annotated instances. Each data instance is multi-labeled with the presence of polarization, polarization type, and polarization manifestation. Participants were asked to predict labels in three sub-tasks: (1) detecting the presence of polarization, (2) identifying the type of polarization, and (3) recognizing the polarization manifestation. The three tasks attracted over 1,000 participants worldwide and more than 10k submission on Codabench. We received final submissions from 67 teams and 73 system description papers. We report the baseline results and analyze the performance of the best-performing systems, highlighting the most common approaches and the most effective methods across different subtasks and languages. The dataset of this task is publicly available.

关键词: online polarization detection, multilingual, shared task, SemEval, annotated dataset, polarization type, polarization manifestation, baseline results

146. ❌ AGSC: Adaptive Granularity and Semantic Clustering for Uncertainty Quantification in Long-text Generation

作者: Guanran Luo, Wentao Qiu, Wanru Zhao, Wenhan Lv, Zhongquan Jian, Meihong Wang, Qingqiang Wu 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06812v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	5.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心研究LLMs在长文本生成中的不确定性量化（UQ）和幻觉缓解，与’Large Language Models’和’Hallucination Mitigation’高度相关（10分）。论文提到长文本生成，与’Long Context LLMs’有一定关联（5分）。AGSC框架减少60%推理时间，与’Inference Acceleration’相关（5分）。实验在BIO数据集上进行，与’AI for Science’有一定关联（5分）。其他关键词如MoE、SFT、RAG等未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型在长文本生成中的幻觉问题，提出了AGSC不确定性量化框架，通过自适应粒度和语义聚类在保持事实性相关性的同时将推理时间减少约60%。

摘要翻译

大语言模型（LLMs）在长文本生成中展现出卓越能力，但其应用受到幻觉问题的制约。不确定性量化（Uncertainty Quantification, UQ）虽对评估可靠性至关重要，但复杂文本结构使得跨异质主题的可靠聚合变得困难；此外，现有方法常忽略中性信息的细微差异，且面临细粒度分解带来的高昂计算成本。为应对这些挑战，我们提出AGSC（自适应粒度与基于高斯混合模型的语义聚类），这是一个专为长文本生成定制的不确定性量化框架。AGSC首先利用自然语言推理（NLI）中性概率作为触发器，将无关信息与不确定性区分开来，从而减少不必要的计算。随后采用高斯混合模型（Gaussian Mixture Model, GMM）软聚类对潜在语义主题进行建模，并为下游聚合分配主题感知权重。在BIO和LongFact数据集上的实验表明，AGSC在实现与事实性最佳相关性的同时，相比完整原子分解方法减少了约60%的推理时间。

摘要 (Abstract)

Large Language Models (LLMs) have demonstrated impressive capabilities in long-form generation, yet their application is hindered by the hallucination problem. While Uncertainty Quantification (UQ) is essential for assessing reliability, the complex structure makes reliable aggregation across heterogeneous themes difficult, in addition, existing methods often overlook the nuance of neutral information and suffer from the high computational cost of fine-grained decomposition. To address these challenges, we propose AGSC (Adaptive Granularity and GMM-based Semantic Clustering), a UQ framework tailored for long-form generation. AGSC first uses NLI neutral probabilities as triggers to distinguish irrelevance from uncertainty, reducing unnecessary computation. It then applies Gaussian Mixture Model (GMM) soft clustering to model latent semantic themes and assign topic-aware weights for downstream aggregation. Experiments on BIO and LongFact show that AGSC achieves state-of-the-art correlation with factuality while reducing inference time by about 60% compared to full atomic decomposition.

关键词: Large Language Models, Uncertainty Quantification, Hallucination Mitigation, Long-text Generation, Adaptive Granularity, Semantic Clustering, Inference Acceleration, Factuality

147. ❌ GCoT-Decoding: Unlocking Deep Reasoning Paths for Universal Question Answering

作者: Guanran Luo, Wentao Qiu, Zhongquan Jian, Meihong Wang, Qingqiang Wu 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06794v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	15.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究Chain-of-Thought推理方法，提出GCoT-decoding解码策略，与’Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’高度相关（15分），涉及深度推理路径生成，与’System 2 Thinking OR Slow Thinking OR In-depth Reasoning’相关（10分），基于大语言模型应用，与’Large Language Models OR LLMs OR Foundation Models’相关（10分）。其他关键词如MoE、量化、RAG等未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

论文针对现有CoT-decoding方法仅适用于固定答案集的问题，提出通用解码策略GCoT-decoding，通过两阶段分支方法和路径聚合机制，在固定和自由问答任务上均实现了性能提升。

摘要翻译

思维链推理能够增强大语言模型的能力，但通常需要人工设计提示来引导模型。近期提出的CoT-decoding方法使模型能够在无需提示的情况下生成思维链式推理路径，但该方法仅适用于答案集合固定的问题。为突破这一局限，我们提出一种通用解码策略GCoT-decoding，将其适用范围扩展至更广泛的问答任务。GCoT-decoding采用结合斐波那契采样与启发式错误回溯的两阶段分支方法生成候选解码路径，随后将每条路径拆分为推理段与答案段以精确计算路径置信度，最后通过聚合语义相似的路径来识别共识答案，替代传统的多数投票机制。我们在涵盖固定答案与开放答案问答任务的六个数据集上进行了大量实验。本方法不仅在固定答案问答任务上保持强劲性能，还在开放答案问答任务上取得显著提升，充分证明了其通用性。

摘要 (Abstract)

Chain-of-Thought reasoning can enhance large language models, but it requires manually designed prompts to guide the model. Recently proposed CoT-decoding enables the model to generate CoT-style reasoning paths without prompts, but it is only applicable to problems with fixed answer sets. To address this limitation, we propose a general decoding strategy GCoT-decoding that extends applicability to a broader range of question-answering tasks. GCoT-decoding employs a two-stage branching method combining Fibonacci sampling and heuristic error backtracking to generate candidate decoding paths. It then splits each path into a reasoning span and an answer span to accurately compute path confidence, and finally aggregates semantically similar paths to identify a consensus answer, replacing traditional majority voting. We conduct extensive experiments on six datasets covering both fixed and free QA tasks. Our method not only maintains strong performance on fixed QA but also achieves significant improvements on free QA, demonstrating its generality.

关键词: Chain-of-Thought, reasoning paths, decoding strategy, question answering, Fibonacci sampling, heuristic backtracking, path confidence, consensus answer

148. ❌ Cognitive Loop of Thought: Reversible Hierarchical Markov Chain for Efficient Mathematical Reasoning

作者: Jia-Chen Zhang, Zheng Zhou, Yu-Jie Xiong 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06805v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	15.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是改进Chain-of-Thought（CoT）推理框架，提出Cognitive Loop of Thought（CLoT）方法，因此与’Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’高度相关（15分）。论文使用LLMs进行数学推理，与’Large Language Models OR LLMs OR Foundation Models’相关（10分）。方法包含向后验证机制，与’System 2 Thinking OR Slow Thinking OR In-depth Reasoning’相关（10分）。验证和修剪机制涉及自我纠正，与’Self-Correction OR Self-Improvement OR Self-Reflection’有一定关联（8分）。其他关键词如MoE、SLMs、Scaling Laws、Pre-training等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文针对长Chain-of-Thought推理序列超出计算限制的问题，提出了一种基于可逆分层马尔可夫链的Cognitive Loop of Thought框架，通过分层分解问题、向后验证和修剪策略，在四个数学基准测试中显著提升了推理准确性和效率。

摘要翻译

多步思维链通过利用显式推理步骤显著提升了大型语言模型的数学推理能力。然而，长思维链的广泛使用常导致序列长度超出可管理的计算限制。现有方法试图通过类似马尔可夫链的结构减少KV缓存冗余来缓解此问题，但引入了两个关键局限：固有的无记忆性（上下文丢失）和有限的后向推理能力。为应对这些局限，我们提出了一种基于可逆分层马尔可夫链的新型思维链框架，称为认知循环思维，并构建了后向推理数据集CLoT-Instruct。在CLoT中，问题被分解为具有层次依赖关系的子问题。受人类认知过程启发，我们在每个层级引入了后向验证机制。此外，我们实施了剪枝策略：一旦高层子问题通过验证，冗余的低层子问题即被剪除以最大化效率。该方法有效缓解了错误传播并增强了推理鲁棒性。在四个数学基准测试上的实验证明了我们方法的有效性。值得注意的是，在使用GPT-4o-mini的AddSub数据集上，CLoT达到了99.0%的准确率，分别比传统思维链和思维链自洽性优化方法高出4.1%和2.9%。

摘要 (Abstract)

Multi-step Chain-of-Thought (CoT) has significantly advanced the mathematical reasoning capabilities of LLMs by leveraging explicit reasoning steps. However, the widespread adoption of Long CoT often results in sequence lengths that exceed manageable computational limits. While existing approaches attempt to alleviate this by reducing KV Cache redundancy via Markov chain-like structures, they introduce two critical limitations: inherent memorylessness (loss of context) and limited backward reasoning capability. To address these limitations, we propose a novel Chain-of-Thought framework based on Reversible Hierarchical Markov Chain, termed Cognitive Loop of Thought (CLoT), and a backward reasoning dataset CLoT-Instruct. In CLoT, problems are decomposed into sub-problems with hierarchical dependencies. Inspired by human cognitive processes, we introduce a backward verification mechanism at each hierarchical layer. Furthermore, we implement a pruning strategy: once higher-level sub-problems are verified, redundant lower-level sub-problems are pruned to maximize efficiency. This approach effectively mitigates error propagation and enhances reasoning robustness. Experiments on four mathematical benchmarks demonstrate the effectiveness of our method. Notably, on the AddSub dataset using GPT-4o-mini, CLoT achieves 99.0% accuracy, outperforming traditional CoT and CoT-SC by 4.1% and 2.9%, respectively.

关键词: Chain-of-Thought, Mathematical Reasoning, Reversible Hierarchical Markov Chain, Cognitive Loop of Thought, Backward Verification, Error Propagation Mitigation, LLM Reasoning, Hierarchical Dependencies

149. ❌ Beyond Accuracy: Diagnosing Algebraic Reasoning Failures in LLMs Across Nine Complexity Dimensions

作者: Parth Patil, Dhruv Kumar, Yash Sinha, Murari Mandal 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06799v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在代数推理中的失败模式诊断，与’Large Language Models’高度相关（10分）。研究评估了’instruction-tuned models’，与’Instruction Tuning’高度相关（10分）。研究涉及代数推理的’Chain of Thought’和’System 2 Thinking’（各10分）。论文通过诊断框架分析失败原因，与’Mechanistic Interpretability’相关（10分）。其他关键词如MoE、SLMs、Scaling Laws、RLHF、RAG、Quantization等未在论文中涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一个九维代数复杂性框架来诊断大语言模型在代数推理中的失败模式，发现工作记忆是主要的尺度不变瓶颈，所有模型在20-30个并行分支处都会崩溃，并识别出五个最小但诊断充分的维度来完整描述模型的代数推理能力。

摘要翻译

代数推理始终是对大型语言模型最具信息量的压力测试之一，然而现有基准测试缺乏将失败归因于特定原因的机制。当模型未能解决代数问题时，单一的准确率分数无法揭示失败是由于表达式嵌套过深、运算符过于罕见、中间状态数量过多，还是依赖链过长所致。先前研究虽已独立探讨过个别失败模式，但尚未有框架能在严格实验控制下独立调节各复杂度因子。现有系统亦无法实现复杂度递增问题的自动生成与验证，以追踪模型随时间的进展。我们提出了一个九维代数复杂度框架，其中每个因子在保持其他因子恒定的条件下独立变化，问题生成与验证均由参数化流程自动处理，无需人工标注。每个维度均基于已记录的LLM失败模式，捕捉代数难度的不同结构层面，包括表达式嵌套深度、并行中间结果数量、子表达式复杂度、运算符难度以及依赖推理链长度。我们在全部九个维度上评估了七个指令微调模型（参数量覆盖8B至235B），发现工作记忆是主导的尺度不变瓶颈。所有模型在并行分支数达到20至30时均出现崩溃，这与参数量无关，表明这是硬性的架构约束而非可解决的容量限制。我们的分析进一步识别出一个最小但诊断充分的五维度子集，该子集共同覆盖了所有已记录的代数失败模式，能够完整描绘模型代数推理能力的复杂度特征谱。

摘要 (Abstract)

Algebraic reasoning remains one of the most informative stress tests for large language models, yet current benchmarks provide no mechanism for attributing failure to a specific cause. When a model fails an algebraic problem, a single accuracy score cannot reveal whether the expression was too deeply nested, the operator too uncommon, the intermediate state count too high, or the dependency chain too long. Prior work has studied individual failure modes in isolation, but no framework has varied each complexity factor independently under strict experimental control. No prior system has offered automatic generation and verification of problems of increasing complexity to track model progress over time. We introduce a nine-dimension algebraic complexity framework in which each factor is varied independently while all others are held fixed, with problem generation and verification handled by a parametric pipeline requiring no human annotation. Each dimension is grounded in a documented LLM failure mode and captures a structurally distinct aspect of algebraic difficulty, including expression nesting depth, simultaneous intermediate result count, sub-expression complexity, operator hardness, and dependent reasoning chain length. We evaluated seven instruction-tuned models spanning 8B to 235B parameters across all nine dimensions and find that working memory is the dominant scale-invariant bottleneck. Every model collapses between 20 and 30 parallel branches regardless of parameter count, pointing to a hard architectural constraint rather than a solvable capacity limitation. Our analysis further identifies a minimal yet diagnostically sufficient subset of five dimensions that together span the full space of documented algebraic failure modes, providing a complete complexity profile of a model’s algebraic reasoning capacity.

关键词: Algebraic Reasoning, Large Language Models, Failure Diagnosis, Complexity Framework, Working Memory Bottleneck, Instruction-tuned Models, Nine-dimension Analysis, Model Evaluation

150. ❌ Video-guided Machine Translation with Global Video Context

作者: Jian Chen, JinZe Lv, Zi Long, XiangHua Fu 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06789v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	8.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究视频引导的多模态翻译，提出使用预训练语义编码器和基于向量数据库的字幕检索构建上下文集，并设计区域感知跨模态注意力机制。与大多数大模型技术关键词无关，但与’Retrieval-Augmented Generation’有一定关联（使用检索机制构建上下文），与’Pre-training’有弱关联（使用预训练编码器）。其他关键词均不相关。

!!! tip deepseek-chat TL;DR

该论文针对长视频中现有视频引导多模态翻译方法局限于局部对齐片段的问题，提出了一个全局视频引导框架，通过预训练语义编码器和向量数据库检索构建相关视频片段上下文集，并设计区域感知跨模态注意力机制，在大型纪录片翻译数据集上显著优于基线模型。

摘要翻译

近年来，视频引导的多模态翻译（Video-guided Multimodal Translation, VMT）取得了显著进展。然而，现有方法大多依赖于与字幕一对一局部对齐的视频片段，限制了其在长视频中捕捉跨多个片段的全局叙事语境的能力。为克服这一局限，我们提出了一种全局视频引导的多模态翻译框架，该框架利用预训练的语义编码器和基于向量数据库的字幕检索，构建与目标字幕语义紧密相关的视频片段语境集合。我们采用注意力机制聚焦于高度相关的视觉内容，同时保留其余视频特征以维持更广泛的上下文信息。此外，我们设计了一种区域感知的跨模态注意力机制，以增强翻译过程中的语义对齐。在大规模纪录片翻译数据集上的实验表明，我们的方法显著优于基线模型，凸显了其在长视频场景中的有效性。

摘要 (Abstract)

Video-guided Multimodal Translation (VMT) has advanced significantly in recent years. However, most existing methods rely on locally aligned video segments paired one-to-one with subtitles, limiting their ability to capture global narrative context across multiple segments in long videos. To overcome this limitation, we propose a globally video-guided multimodal translation framework that leverages a pretrained semantic encoder and vector database-based subtitle retrieval to construct a context set of video segments closely related to the target subtitle semantics. An attention mechanism is employed to focus on highly relevant visual content, while preserving the remaining video features to retain broader contextual information. Furthermore, we design a region-aware cross-modal attention mechanism to enhance semantic alignment during translation. Experiments on a large-scale documentary translation dataset demonstrate that our method significantly outperforms baseline models, highlighting its effectiveness in long-video scenarios.

关键词: Video-guided Multimodal Translation, global video context, pretrained semantic encoder, vector database retrieval, cross-modal attention, long-video translation, documentary translation, semantic alignment

151. ❌ From Perception to Autonomous Computational Modeling: A Multi-Agent Approach

作者: Daniel N. Wilke 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06788v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心是使用协调的LLM代理自主执行完整的计算力学工作流程，从感知数据到工程报告。高度相关的关键词包括：LLMs（核心组件）、LLM Agents/Autonomous Agents（框架核心）、Multi-agent Systems（协调代理）、AI for Science（工程应用）。Tool Use有一定关联（代理执行各种工程任务）。其他关键词如MoE、SFT、RAG等未在摘要中提及，与论文技术焦点无关。

!!! tip deepseek-chat TL;DR

该论文提出了一个基于协调大型语言模型代理的框架，能够自主执行从感知数据到工程报告的完整计算力学工作流程，并通过钢制L型支架的有限元分析案例验证了其可行性。

摘要翻译

我们提出一个与求解器无关的框架，其中协调运作的大型语言模型（LLM）智能体能够自主执行完整的计算力学工作流程：从工程构件的感知数据出发，通过几何提取、材料推断、离散化、求解器执行、不确定性量化以及规范符合性评估，最终生成包含可执行建议的工程报告。智能体被形式化为共享上下文空间上的条件化算子，其质量门控机制在流程各层之间引入了条件迭代。我们提出一个数学框架，用于在不确定性下从感知数据中提取工程信息，该框架采用区间界限、概率密度函数和模糊隶属函数，并引入任务相关的保守性策略，以解决当不同极限状态受相反参数趋势控制时“保守”含义的模糊性问题。该框架通过应用于钢制L型支架照片的有限元分析流程进行演示，生成了包含171,504个节点的四面体网格，在三种边界条件假设下完成了七项分析，并通过规范符合性评估揭示了结构失效问题及量化后的重新设计方案。所有结果均为首次自主迭代生成，未经人工修正，这进一步强调任何此类分析都必须由专业工程师进行审查和签字确认。

摘要 (Abstract)

We present a solver-agnostic framework in which coordinated large language model (LLM) agents autonomously execute the complete computational mechanics workflow, from perceptual data of an engineering component through geometry extraction, material inference, discretisation, solver execution, uncertainty quantification, and code-compliant assessment, to an engineering report with actionable recommendations. Agents are formalised as conditioned operators on a shared context space with quality gates that introduce conditional iteration between pipeline layers. We introduce a mathematical framework for extracting engineering information from perceptual data under uncertainty using interval bounds, probability densities, and fuzzy membership functions, and introduce task-dependent conservatism to resolve the ambiguity of what `conservative’ means when different limit states are governed by opposing parameter trends. The framework is demonstrated through a finite element analysis pipeline applied to a photograph of a steel L-bracket, producing a 171,504-node tetrahedral mesh, seven analyses across three boundary condition hypotheses, and a code-compliant assessment revealing structural failure with a quantified redesign. All results are presented as generated in the first autonomous iteration without manual correction, reinforcing that a professional engineer must review and sign off on any such analysis.

关键词: large language model agents, multi-agent systems, autonomous computational modeling, computational mechanics, finite element analysis, engineering workflow, perceptual data, uncertainty quantification

152. ❌ When Is Thinking Enough? Early Exit via Sufficiency Assessment for Efficient Reasoning

作者: Yang Xiang, Yixin Ji, Ruotao Xu, Dan Qiao, Zheming Yang, Juntao Li, Min Zhang 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06787v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	8.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大型推理模型（LRMs）在链式思维（CoT）推理中的效率问题，提出DTSR框架实现动态早期退出。高度相关关键词：‘Large Language Models’（论文明确研究LRMs）、‘Chain of Thought’（核心研究CoT推理）、‘System 2 Thinking’（涉及深度推理效率）、‘Self-Correction’（包含反思和自我评估）、‘Speculative Decoding’（涉及推理加速）。其他关键词如MoE、量化、RAG等未在论文中涉及。

!!! tip deepseek-chat TL;DR

该论文针对大型推理模型在链式思维推理中存在的过度思考问题，提出了动态思维充分性评估框架DTSR，实现了28.9%-34.9%的推理长度减少且性能损失最小。

摘要翻译

大型推理模型（LRMs）凭借其强大的推理时扩展能力，在复杂推理任务中取得了显著性能。然而，LRMs常受过度思考问题困扰，导致大量计算冗余并显著降低效率。早期退出方法旨在通过一旦生成足够证据即终止推理来缓解此问题，但现有方法大多依赖手工设计或经验性指标，这些指标不可靠且不实用。在本研究中，我们提出动态思维充分性推理（DTSR），这是一种高效推理的新框架，使模型能够动态评估其思维链（CoT）的充分性，并确定早期退出的最佳时机。受人类元认知启发，DTSR分两阶段运行：（1）反思信号监测，识别作为早期退出潜在线索的反思信号；（2）思维充分性检查，评估当前CoT是否足以推导最终答案。基于Qwen3模型的实验结果表明，DTSR在性能损失最小的情况下将推理长度减少28.9%-34.9%，有效缓解了过度思考问题。我们进一步探讨了LRMs的过度自信与自我评估范式，为早期退出推理提供了有价值的见解。

摘要 (Abstract)

Large reasoning models (LRMs) have achieved remarkable performance in complex reasoning tasks, driven by their powerful inference-time scaling capability. However, LRMs often suffer from overthinking, which results in substantial computational redundancy and significantly reduces efficiency. Early-exit methods aim to mitigate this issue by terminating reasoning once sufficient evidence has been generated, yet existing approaches mostly rely on handcrafted or empirical indicators that are unreliable and impractical. In this work, we introduce Dynamic Thought Sufficiency in Reasoning (DTSR), a novel framework for efficient reasoning that enables the model to dynamically assess the sufficiency of its chain-of-thought (CoT) and determine the optimal point for early exit. Inspired by human metacognition, DTSR operates in two stages: (1) Reflection Signal Monitoring, which identifies reflection signals as potential cues for early exit, and (2) Thought Sufficiency Check, which evaluates whether the current CoT is sufficient to derive the final answer. Experimental results on the Qwen3 models show that DTSR reduces reasoning length by 28.9%-34.9% with minimal performance loss, effectively mitigating overthinking. We further discuss overconfidence in LRMs and self-evaluation paradigms, providing valuable insights for early-exit reasoning.

关键词: Large Reasoning Models, Early Exit, Chain-of-Thought, Overthinking, Efficient Reasoning, Dynamic Thought Sufficiency, Self-evaluation, Inference Acceleration

153. ❌ Discourse Coherence and Response-Guided Context Rewriting for Multi-Party Dialogue Generation

作者: Zhiyu Cao, Peifeng Li, Qiaoming Zhu 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06784v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于多轮对话生成中的上下文重写技术，提出DRCR框架利用话语连贯性和响应质量作为反馈信号，并采用动态自进化学习方法。论文内容完全围绕对话系统、自然语言处理、上下文重写等传统NLP领域，未涉及大模型、深度学习技术原理创新、AI for Science等关键词所指向的大模型技术、优化方法或科学应用。所有关键词均与论文核心内容无关，因此全部评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对多轮对话中口语化表达和不完整话语导致理解困难的问题，提出了基于话语连贯性和响应质量反馈的上下文重写框架DRCR，通过动态自进化学习方法显著提升了多轮对话生成的质量。

摘要翻译

先前关于多方对话生成的研究主要利用对话中固有的结构信息直接指导生成过程。然而，对话中普遍存在的口语化表达和不完整话语往往阻碍理解，并削弱对话结构表征的保真度，这一现象在多方对话中尤为显著。本研究提出了一种新颖的框架DRCR（话语连贯性与响应引导的上下文重写），通过对话上下文重写来改进多方对话生成。具体而言，DRCR采用话语连贯性和响应质量两种互补的反馈信号，为上下文重写和响应生成构建偏好数据。此外，我们提出了一种动态自进化学习方法，使重写模块和响应生成模块能够在迭代训练循环中通过相互交互持续提升能力。在四个多方对话数据集上进行的全面实验证实了DRCR的有效性。

摘要 (Abstract)

Previous research on multi-party dialogue generation has predominantly leveraged structural information inherent in dialogues to directly inform the generation process. However, the prevalence of colloquial expressions and incomplete utterances in dialogues often impedes comprehension and weakens the fidelity of dialogue structure representations, which is particularly pronounced in multi-party dialogues. In this work, we propose a novel framework DRCR (Discourse coherence and Response-guided Context Rewriting) to improve multi-party dialogue generation through dialogue context rewriting. Specifically, DRCR employs two complementary feedback signals, discourse coherence and response quality, to construct preference data for both context rewriting and response generation. Moreover, we propose a dynamic self-evolution learning method that allows the rewriter and responder to continuously enhance their capabilities through mutual interaction in an iterative training loop. Comprehensive experiments conducted on four multi-party dialogue datasets substantiate the effectiveness of DRCR.

关键词: multi-party dialogue generation, dialogue context rewriting, discourse coherence, response quality, preference data, dynamic self-evolution learning, iterative training loop

154. ❌ Multilingual Cognitive Impairment Detection in the Era of Foundation Models

作者: Damar Hoogland, Boshko Koloski, Jaya Caporusso, Tine Kolenik, Ana Zwitter Vitez, Senja Pollak, Christina Manouilidou, Matthew Purver 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06758v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心研究是使用大语言模型（LLMs/Foundation Models）进行多语言认知障碍检测，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。研究属于AI在医疗/科学领域的应用，与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），但并非核心的生物信息学或化学信息学。论文主要评估零样本LLMs作为分类器，未涉及其他关键词如MoE、SLMs、训练技术（预训练、微调、对齐等）、推理优化、代理系统、模型压缩等具体技术原理或创新，因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该研究评估了使用零样本大语言模型（LLMs）与基于特征工程的监督方法在多语言（英语、斯洛文尼亚语、韩语）认知障碍检测中的性能，发现监督方法在结合语言特征和嵌入时通常表现更好，而零样本LLMs可作为有竞争力的无训练基线。

摘要翻译

我们评估了基于英语、斯洛文尼亚语和韩语语音转录文本的认知障碍（CI）分类。我们比较了三种输入设置下作为直接分类器的零样本大语言模型（LLM）——仅文本转录、仅语言特征、以及两者结合——与在留一法协议下训练的监督式表格模型。表格模型基于人工设计的语言特征、文本嵌入向量，以及两种模态的早期或晚期融合进行工作。跨语言实验表明，零样本LLM提供了具有竞争力的无训练基线，但监督式表格模型通常表现更优，尤其是在结合了人工设计的语言特征与嵌入向量时。针对嵌入向量的少样本实验表明，有限监督的价值具有语言依赖性：某些语言能从额外的标注样本中显著获益，而其他语言在缺乏更丰富特征表征的情况下仍受限制。总体而言，结果表明，在小数据量的CI检测中，结构化的语言信号与基于融合的简单分类器依然是强大且可靠的信号来源。

摘要 (Abstract)

We evaluate cognitive impairment (CI) classification from transcripts of speech in English, Slovene, and Korean. We compare zero-shot large language models (LLMs) used as direct classifiers under three input settings – transcript-only, linguistic-features-only, and combined – with supervised tabular approaches trained under a leave-one-out protocol. The tabular models operate on engineered linguistic features, transcript embeddings, and early or late fusion of both modalities. Across languages, zero-shot LLMs provide competitive no-training baselines, but supervised tabular models generally perform better, particularly when engineered linguistic features are included and combined with embeddings. Few-shot experiments focusing on embeddings indicate that the value of limited supervision is language-dependent, with some languages benefiting substantially from additional labelled examples while others remain constrained without richer feature representations. Overall, the results suggest that, in small-data CI detection, structured linguistic signals and simple fusion-based classifiers remain strong and reliable signals.

关键词: cognitive impairment detection, large language models, zero-shot classification, multilingual analysis, linguistic features, speech transcripts, supervised tabular models, embedding fusion

155. ❌ How Long Reasoning Chains Influence LLMs’ Judgment of Answer Factuality

作者: Minzhu Tu, Shiyu Ni, Keping Bi 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06756v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs作为评估工具时，推理链（Chain of Thought）如何影响其对答案事实性（Factuality）的判断，因此与’Large Language Models’、‘Chain of Thought’、‘Hallucination Mitigation’高度相关（10分）；涉及深度推理评估，与’System 2 Thinking’有一定关联（8分）；研究LLM判断的局限性，与’Self-Correction’和’Mechanistic Interpretability’有间接联系（5分）；其他关键词如MoE、量化、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了推理链的存在如何影响LLMs对答案事实性的判断，发现弱评估模型容易被流畅但错误的推理误导，而强评估模型能部分利用推理作为证据，但两者都可能被高质量表面的推理链欺骗。

摘要翻译

大型语言模型（LLM）已被广泛采用作为人类评估的可扩展替代方案，然而此类评判者仍不完善，易受表层偏见影响。一个可能的原因是这些评判者在评估答案正确性时缺乏充分信息。随着具备推理能力的模型兴起，将生成器的推理内容暴露给评判者提供了更丰富的信息，自然成为提升评判准确性的潜在途径。然而，其对评判者行为的实际影响仍未得到充分研究。本文系统性地探究了获取推理链如何影响基于LLM的评判，涵盖事实问答（QA）和数学推理基准测试。我们发现，弱评判者极易受推理内容存在的影响，频繁接受伴随流畅推理的错误答案；而强评判者能部分地将推理作为信息证据加以利用。尽管如此，即使强评判者也会被看似高质量的推理链误导。受控实验进一步揭示，推理链的流畅性与事实性均是驱动评判者决策的关键信号。这些发现表明，在评估现代推理模型时，需要更具鲁棒性的LLM评判者，能够区分真正的推理质量与表面的流畅性。

摘要 (Abstract)

Large language models (LLMs) has been widely adopted as a scalable surrogate for human evaluation, yet such judges remain imperfect and susceptible to surface-level biases. One possible reason is that these judges lack sufficient information in assessing answer correctness. With the rise of reasoning-capable models, exposing a generator’s reasoning content to the judge provides richer information and is a natural candidate for improving judgment accuracy. However, its actual impact on judge behavior remains understudied. In this paper, we systematically investigate how access to reasoning chains affects LLM-based judgment across factual question answering (QA) and mathematical reasoning benchmarks. We find that weak judges are easily swayed by reasoning presence, frequently accepting incorrect answers accompanied by fluent reasoning, while strong judges can partially leverage reasoning as informative evidence. Nevertheless, even strong judges are misled by seemingly high-quality reasoning chains. Controlled experiments further reveal that both fluency and factuality of reasoning chains are critical signals driving judge decisions. These findings highlight the need for more robust LLM judges that can distinguish genuine reasoning quality from superficial fluency when evaluating modern reasoning models.

关键词: Large Language Models, reasoning chains, factuality judgment, evaluation accuracy, surface-level biases, mathematical reasoning, question answering, judge behavior

156. ❌ Select-then-Solve: Paradigm Routing as Inference-Time Optimization for LLM Agents

作者: Heng Zhou, Zelin Tan, Zhemeng Zhang, Yutao Fan, Yibing Lin, Li Kang, Xiufeng Song, Rui Li, Songtao Huang, Ao Yu, Yuchen Fan, Yanxu Chen, Kaixin Xu, Xiaohong Liu, Yiran Qin, Philip Torr, Chen Zhang, Zhenfei Yin 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06753v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM代理在不同推理范式（如CoT、ReAct、Plan-Execute、Reflection、ReCode）下的性能比较，并提出基于嵌入的路由器进行范式选择。因此，与’Large Language Models’、‘Chain of Thought’、‘LLM Agents’高度相关（10分）；与’System 2 Thinking’、‘Self-Correction’、‘Tool Use’有一定关联（5分），因为这些范式涉及深度推理、自我反思和工具使用；其他关键词未在论文中涉及或仅为间接背景，评为0分。

!!! tip deepseek-chat TL;DR

该论文研究了LLM代理在不同推理范式下的性能互补性，并提出一种基于嵌入的轻量级路由器，在推理时动态选择最优范式，从而显著提升任务准确率。

摘要翻译

当基于大语言模型的智能体在任务上取得提升时，这种增益究竟源于模型本身，还是源于其外部的推理范式？为探究此问题，我们比较了六种推理时范式——直接生成（Direct）、思维链（CoT）、推理与行动（ReAct）、规划-执行（Plan-Execute）、反思（Reflection）与重编码（ReCode），在四种前沿大语言模型和十个基准测试上进行了约18,000次实验。研究发现，推理结构在某些任务上能显著提升性能，而在另一些任务上却会产生负面影响：例如在GAIA基准上，ReAct相比Direct提升了44个百分点，但在HumanEval上，CoT却导致性能下降15个百分点。没有任何单一范式占据绝对优势，而针对每项任务进行理想选择（oracle per-task selection）的平均表现比最佳固定范式高出17.1个百分点。
受这种互补性的启发，我们提出一种“先选择后求解”的方法：在回答每项任务前，通过一个轻量级的基于嵌入的路由器选择最合适的推理范式。在四种模型上的实验表明，该路由器将平均准确率从47.6%提升至53.1%，优于最佳固定范式的50.3%（提升2.8个百分点），并弥补了理想选择差距的37%。相比之下，零样本自我路由仅在使用GPT-5时达到67.1%的有效性，对于能力较弱的模型则完全失效，且均落后于学习型路由器。我们的结果表明，推理范式的选择应当成为由学习型路由器针对每项任务动态做出的决策，而非一种固定的架构选择。

摘要 (Abstract)

When an LLM-based agent improves on a task, is the gain from the model itself or from the reasoning paradigm wrapped around it? We study this question by comparing six inference-time paradigms, namely Direct, CoT, ReAct, Plan-Execute, Reflection, and ReCode, across four frontier LLMs and ten benchmarks, yielding roughly 18,000 runs. We find that reasoning structure helps dramatically on some tasks but hurts on others: ReAct improves over Direct by 44pp on GAIA, while CoT degrades performance by 15pp on HumanEval. No single paradigm dominates, and oracle per-task selection beats the best fixed paradigm by 17.1pp on average. Motivated by this complementarity, we propose a select-then-solve approach: before answering each task, a lightweight embedding-based router selects the most suitable paradigm. Across four models, the router improves average accuracy from 47.6% to 53.1%, outperforming the best fixed paradigm at 50.3% by 2.8pp and recovering up to 37% of the oracle gap. In contrast, zero-shot self-routing only works for GPT-5 at 67.1% and fails for weaker models, all trailing the learned router. Our results argue that reasoning paradigm selection should be a per-task decision made by a learned router, not a fixed architectural choice.

关键词: LLM agents, reasoning paradigms, CoT, ReAct, Plan-Execute, Reflection, ReCode, paradigm routing

157. ❌ TEC: A Collection of Human Trial-and-error Trajectories for Problem Solving

作者: Xinkai Zhang, Jingtao Zhan, Yiqun Liu, Qingyao Ai 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06734v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要贡献是创建了一个记录人类试错过程的数据集TEC，用于开发更强大的AI系统。与关键词的相关性分析：1）‘Large Language Models’得5分，因为论文将人类表现与LLMs对比，发现人类更有效，但LLMs不是论文核心；2）‘Self-Correction’得5分，因为数据集包含错误反馈后的反思，涉及自我改进概念；3）‘LLM Agents’得5分，因为论文提到AI系统在现实环境中的操作能力，与智能体相关；其他关键词如MoE、SFT、RAG等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文通过创建TEC数据集记录人类试错轨迹，发现人类在问题解决中比LLMs更有效，为开发更强大的AI系统提供了数据基础。

摘要翻译

试错是人类解决复杂问题的基本策略，也是人工智能系统在现实环境中运行的必要能力。尽管近期已有若干试错式人工智能技术被提出，但其中大多数依赖于研究者设计的简单启发式方法，性能提升有限。核心问题在于缺乏合适的数据：当前模型无法从人类实际试错过程的详细记录中学习。为填补这一空白，我们引入了一个数据标注平台及相应数据集，称为试错行为收集库。该平台记录用户在多次尝试中的完整操作轨迹，并收集其在收到错误反馈后的反思。利用此平台，我们记录了46名参与者在58项任务中的问题解决过程，最终获得5,370条试错轨迹以及跨越41,229个网页的错误反思。基于该数据集，我们观察到人类相比大语言模型实现了显著更高的准确率，这证明人类在试错过程中比大语言模型更为高效。我们相信，试错行为收集库平台与数据集为理解人类试错行为及开发更强大的人工智能系统提供了宝贵基础。平台与数据集已公开提供。

摘要 (Abstract)

Trial-and-error is a fundamental strategy for humans to solve complex problems and a necessary capability for Artificial Intelligence (AI) systems operating in real-world environments. Although several trial-and-error AI techniques have recently been proposed, most of them rely on simple heuristics designed by researchers and achieve limited performance gains. The core issue is the absence of appropriate data: current models cannot learn from detailed records of how humans actually conduct trial-and-error in practice. To address this gap, we introduce a data annotation platform and a corresponding dataset, termed Trial-and-Error Collection (TEC). The platform records users’ complete trajectories across multiple trials and collects their reflections after receiving error feedback. Using this platform, we record the problem-solving processes of 46 participants on 58 tasks, resulting in 5,370 trial trajectories along with error reflections across 41,229 webpages. With this dataset, we observe that humans achieve substantially higher accuracy compared to LLMs, which demonstrates that humans are more effective in trial-and-error than LLMs. We believe that the TEC platform and dataset provide a valuable foundation for understanding human trial-and-error behavior and for developing more capable AI systems. Platform and dataset are publicly available.

关键词: Trial-and-error, Problem Solving, Human Trajectories, Dataset, AI Systems, LLMs, Error Reflection, Webpages

158. ❌ Specializing Large Models for Oracle Bone Script Interpretation via Component-Grounded Multimodal Knowledge Augmentation

作者: Jianing Zhang, Runan Li, Honglin Pang, Ding Xia, Zhou Zhu, Qian Zhang, Chuntao Li, Xi Yang 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06711v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出一个用于甲骨文解读的代理驱动VLM框架，核心使用LLM-based agent进行推理链（component identification, graph-based knowledge retrieval, relationship inference），这与’LLM Agents’、‘Chain of Thought’、‘System 2 Thinking’高度相关（10分）。框架涉及知识检索，与’Retrieval-Augmented Generation’有一定关联（5分）。甲骨文解读属于AI for Science应用（10分）。论文明确使用LLM，与’Large Language Models’高度相关（10分）。框架可能涉及工具使用（5分）。其他关键词如MoE、SFT、量化等未涉及（0分）。

!!! tip deepseek-chat TL;DR

该研究提出一个代理驱动的视觉语言模型框架，通过组件识别、知识检索和关系推理的推理链，解决了甲骨文解读中的'解释鸿沟'问题，并在三个基准测试中实现了比基线方法更详细和精确的解读。

摘要翻译

解读中国古代甲骨文是一项极具挑战性的任务，它能为我们理解古代信仰、制度与文化提供重要线索。现有方法将破译视为封闭集的图像识别问题，这难以弥合“解读鸿沟”：尽管单个字符通常是独特且罕见的，但它们由一组有限的、重复出现的象形构件组成，这些构件承载着可迁移的语义含义。为利用这种结构逻辑，我们提出了一种智能体驱动的视觉-语言模型框架。该框架整合了用于精确视觉定位的VLM与基于大语言模型的智能体，通过自动化执行构件识别、基于图结构的知识检索和关系推理的推理链，以实现语言层面准确的解读。为支持此框架，我们还引入了OB-Radix数据集——一个经专家标注、提供了以往语料库中缺失的结构与语义数据的数据集。该数据集包含1,022个字符图像（涵盖934个独特字符）以及1,853个精细构件图像，这些构件分属478个不同类别，且均附有经过验证的释义。通过在三个不同任务的基准测试中评估我们的系统，我们证明相较于基线方法，本框架能产生更详尽、更精确的破译结果。

摘要 (Abstract)

Deciphering ancient Chinese Oracle Bone Script (OBS) is a challenging task that offers insights into the beliefs, systems, and culture of the ancient era. Existing approaches treat decipherment as a closed-set image recognition problem, which fails to bridge the ``interpretation gap’’: while individual characters are often unique and rare, they are composed of a limited set of recurring, pictographic components that carry transferable semantic meanings. To leverage this structural logic, we propose an agent-driven Vision-Language Model (VLM) framework that integrates a VLM for precise visual grounding with an LLM-based agent to automate a reasoning chain of component identification, graph-based knowledge retrieval, and relationship inference for linguistically accurate interpretation. To support this, we also introduce OB-Radix, an expert-annotated dataset providing structural and semantic data absent from prior corpora, comprising 1,022 character images (934 unique characters) and 1,853 fine-grained component images across 478 distinct components with verified explanations. By evaluating our system across three benchmarks of different tasks, we demonstrate that our framework yields more detailed and precise decipherments compared to baseline methods.

关键词: Oracle Bone Script, Vision-Language Model, LLM-based agent, component identification, knowledge retrieval, reasoning chain, OB-Radix dataset, ancient script decipherment

159. ❌ Feedback Adaptation for Retrieval-Augmented Generation

作者: Jihwan Bang, Seunghan Yang, Kyuhong Shim, Simyung Chang, Juntae Lee, Sungha Choi 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06647v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究RAG系统的反馈适应问题，提出新的评估指标（correction lag, post-feedback performance）和无需重新训练的PatchRAG方法。与’Retrieval-Augmented Generation’高度相关（10分），因为RAG是论文的核心研究对象；与’Large Language Models’有一定关联（8分），因为RAG系统通常基于LLMs构建；与’Self-Correction’有中等关联（5分），因为反馈适应涉及系统根据反馈进行自我修正；其他关键词与论文内容无直接关系。

!!! tip deepseek-chat TL;DR

该论文研究了检索增强生成（RAG）系统在交互环境中如何有效适应反馈的问题，提出了评估反馈适应的两个新指标（correction lag和post-feedback performance），并开发了无需重新训练的PatchRAG方法，实现了即时修正和良好的泛化性能。

摘要翻译

检索增强生成系统通常在静态假设下进行评估，尽管在部署中常通过用户或专家反馈进行修正。现有评估方案侧重于整体准确性，未能捕捉系统在引入反馈后的适应过程。我们提出将反馈适应作为RAG系统的一个问题设定，旨在探究纠正性反馈对未来查询的影响效率与速度。为使该行为可量化，我们提出两个评估维度：修正延迟（用于衡量反馈提供与行为改变之间的时间差）和反馈后性能（用于评估系统在反馈后对语义相关查询的可靠性）。基于这些指标，我们发现基于训练的方法在延迟修正与可靠适应之间存在权衡。我们进一步提出PatchRAG——一种无需重新训练即可整合反馈的轻量级推理实现方案，该方案在提出的评估框架下展现出即时修正能力和优异的反馈后泛化性能。我们的研究结果揭示了反馈适应作为交互场景中RAG系统行为先前被忽视的重要维度。

摘要 (Abstract)

Retrieval-Augmented Generation (RAG) systems are typically evaluated under static assumptions, despite being frequently corrected through user or expert feedback in deployment. Existing evaluation protocols focus on overall accuracy and fail to capture how systems adapt after feedback is introduced. We introduce feedback adaptation as a problem setting for RAG systems, which asks how effectively and how quickly corrective feedback propagates to future queries. To make this behavior measurable, we propose two evaluation axes: correction lag, which captures the delay between feedback provision and behavioral change, and post-feedback performance, which measures reliability on semantically related queries after feedback. Using these metrics, we show that training-based approaches exhibit a trade-off between delayed correction and reliable adaptation. We further propose PatchRAG, a minimal inference-time instantiation that incorporates feedback without retraining, demonstrating immediate correction and strong post-feedback generalization under the proposed evaluation. Our results highlight feedback adaptation as a previously overlooked dimension of RAG system behavior in interactive settings.

关键词: Retrieval-Augmented Generation, RAG, feedback adaptation, correction lag, post-feedback performance, PatchRAG, evaluation metrics, interactive systems

160. ❌ DiffuMask: Diffusion Language Model for Token-level Prompt Pruning

作者: Caleb Zheng, Jyotika Singh, Fang Tu, Weiyi Sun, Sujeeth Bharadwaj, Yassine Benajiba, Sujith Ravi, Eli Shlizerman, Dan Roth 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06627v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	10.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）的提示压缩技术，通过扩散模型框架DiffuMask实现并行化令牌级提示修剪，以加速上下文学习（In-context Learning）和思维链（Chain of Thought）推理过程。因此，与’Large Language Models OR LLMs OR Foundation Models’、‘Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’和’In-context Learning OR Many-shot Learning’高度相关（10分），这些是论文的核心内容。其他关键词如MoE、SLMs、对齐、RAG、量化等均未在摘要中提及或与论文主题无关，故评0分。

!!! tip deepseek-chat TL;DR

论文提出DiffuMask，一种基于扩散模型的框架，用于快速并行化提示压缩，以减少大语言模型中上下文学习和思维链提示的长度，同时保持或提高推理准确性。

摘要翻译

上下文学习与思维链提示技术能够提升大语言模型的推理能力，但通常需要更长、成本更高的提示，其中可能包含冗余信息。基于剪枝的提示压缩提供了一种实用解决方案，然而现有方法依赖于顺序性令牌移除，计算开销较大。本文提出DiffuMask——一种基于扩散模型的框架，它整合了层级化的镜头级与令牌级剪枝信号，通过迭代式掩码预测实现快速并行的提示剪枝。DiffuMask通过在每一步去噪过程中同时掩蔽多个令牌，显著加速了压缩流程。该框架提供对保留内容的可调控能力，既能维持关键推理上下文，又能实现高达80%的提示长度缩减。同时，在领域内、领域外及跨模型场景中，其准确率均保持稳定或有所提升。实验结果表明，DiffuMask为提示压缩提供了一个可泛化且可控的框架，有助于在大语言模型中实现更快速、更可靠的上下文推理。

摘要 (Abstract)

In-Context Learning and Chain-of-Thought prompting improve reasoning in large language models (LLMs). These typically come at the cost of longer, more expensive prompts that may contain redundant information. Prompt compression based on pruning offers a practical solution, yet existing methods rely on sequential token removal which is computationally intensive. We present DiffuMask, a diffusion-based framework integrating hierarchical shot-level and token-level pruning signals, that enables rapid and parallel prompt pruning via iterative mask prediction. DiffuMask substantially accelerates the compression process via masking multiple tokens in each denoising step. It offers tunable control over retained content, preserving essential reasoning context and achieving up to 80% prompt length reduction. Meanwhile, it maintains or improves accuracy across in-domain, out-of-domain, and cross-model settings. Our results show that DiffuMask provides a generalizable and controllable framework for prompt compression, facilitating faster and more reliable in-context reasoning in LLMs.

关键词: Diffusion Language Model, Prompt Pruning, In-Context Learning, Chain-of-Thought, Token-level Compression, Large Language Models, Reasoning Acceleration, Parallel Mask Prediction

161. ❌ The Detection–Extraction Gap: Models Know the Answer Before They Can Say It

作者: Hanyang Wang, Mingxuan Zhu 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06613v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	8.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究现代推理模型（特别是使用思维链的LLMs）在答案已确定后继续生成文本的现象，提出检测-提取差距概念，并开发了Black-box Adaptive Early Exit方法来提前终止生成。核心相关关键词：1) ‘Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’（10分）- 论文直接研究思维链推理模型的行为；2) ‘Large Language Models OR LLMs OR Foundation Models’（10分）- 研究基于现代LLMs的推理模型；3) ‘System 2 Thinking OR Slow Thinking OR In-depth Reasoning’（8分）- 涉及深度推理过程分析；4) ‘Speculative Decoding OR Inference Acceleration’（8分）- 提出的BAEE方法显著减少生成时间，实现推理加速；5) ‘Self-Correction OR Self-Improvement OR Self-Reflection’（5分）- 涉及模型自我改进潜力；6) ‘Mechanistic Interpretability OR Explainable AI’（5分）- 分析模型内部推理机制。其他关键词与论文内容无直接关联。

!!! tip deepseek-chat TL;DR

论文发现现代推理模型在答案已确定后仍继续生成大量思维链文本，提出检测-提取差距概念，并开发了Black-box Adaptive Early Exit方法，能在减少70-78%生成量的同时提高1-5个百分点的准确率。

摘要翻译

现代推理模型在答案已确定后仍持续生成冗长内容。通过对五种模型配置、两个模型系列及三个基准测试的分析，我们发现52-88%的思维链标记产生于答案可从部分前缀中复原之后。这种后置生成揭示了一种结构性现象：检测-提取间隙。早期前缀的自由延续即使在10%的生成轨迹中也能复原正确答案，而强制提取在这些情况下失败率达42%。答案本可从模型状态中复原，但提示条件解码却未能提取。我们通过自由延续分布与强制延续分布间的全变差界限量化这种不匹配，从而对后缀诱导偏移进行定量估计。利用这种不对称性，我们提出黑盒自适应提前退出（Black-box Adaptive Early Exit, BAEE），该方法使用自由延续同时进行检测与提取，在所有模型上实现1-5个百分点的准确率提升的同时，截断70-78%的序列生成。对于思维模式模型，提前退出能防止后置覆盖，带来最高5.8个百分点的性能增益；成本优化变体以中位数9次API调用的代价实现68-73%的生成缩减。代码发布于https://github.com/EdWangLoDaSc/know2say。

摘要 (Abstract)

Modern reasoning models continue generating long after the answer is already determined. Across five model configurations, two families, and three benchmarks, we find that \textbf{52–88% of chain-of-thought tokens are produced after the answer is recoverable} from a partial prefix. This post-commitment generation reveals a structural phenomenon: the \textbf{detection–extraction gap}. Free continuations from early prefixes recover the correct answer even at 10% of the trace, while forced extraction fails on 42% of these cases. The answer is recoverable from the model state, yet prompt-conditioned decoding fails to extract it. We formalize this mismatch via a total-variation bound between free and forced continuation distributions, yielding quantitative estimates of suffix-induced shift. Exploiting this asymmetry, we propose Black-box Adaptive Early Exit (\BAEE{}), which uses free continuations for both detection and extraction, truncating \textbf{70–78% of serial generation} while \textbf{improving accuracy by 1–5,pp} across all models. For thinking-mode models, early exit prevents post-commitment overwriting, yielding gains of up to 5.8,pp; a cost-optimized variant achieves 68–73% reduction at a median of 9 API calls. Code is available at https://github.com/EdWangLoDaSc/know2say.

关键词: detection-extraction gap, chain-of-thought, reasoning models, early exit, generation efficiency, model inference, answer recovery, post-commitment generation

162. ❌ Scientific Knowledge-driven Decoding Constraints Improving the Reliability of LLMs

作者: Maotian Ma, Zheni Zeng, Zhenghao Liu, Yukun Yan 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06603v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究LLMs在科学领域的应用，提出SciDC方法通过知识驱动的解码约束提高LLMs可靠性，直接涉及’Large Language Models’（核心研究对象）、‘Hallucination Mitigation’（解决幻觉问题）和’AI for Science’（应用于科学任务）。其他关键词如MoE、SFT、RAG等未在摘要中提及，与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文针对LLMs在科学应用中存在严重幻觉的问题，提出了一种将学科知识转化为多层标准化规则的知识驱动解码约束方法SciDC，在工业配方设计、临床肿瘤诊断和逆合成规划等科学任务上平均提升了12%的准确率。

摘要翻译

大语言模型（LLMs）已展现出强大的知识储备与任务解决能力，但仍面临严重的幻觉问题挑战，阻碍了其实际应用。尽管科学理论与规则能有效指导人类操作者的行为，LLMs目前仍未能通过训练或提示充分运用这些高度凝练的知识。为解决这一问题，我们提出\textbf{SciDC}——一种融合学科专业知识与强约束的大语言模型生成方法。通过采用高性能大语言模型自动将灵活的知识转化为多层次、标准化的规则，我们构建了一个可扩展的框架，以有效约束模型在领域任务上的生成过程。在工业配方设计、临床肿瘤诊断及逆合成规划等科学任务上的实验一致证明了该方法的有效性，相比原始生成方法平均准确率提升12%。我们进一步探讨了大语言模型在自动归纳总结高度凝练知识方面的潜力，展望了加速整体科研流程的实用解决方案。本文所有代码均已公开（https://github.com/Maotian-Ma/SciDC）。

摘要 (Abstract)

Large language models (LLMs) have shown strong knowledge reserves and task-solving capabilities, but still face the challenge of severe hallucination, hindering their practical application. Though scientific theories and rules can efficiently direct the behaviors of human manipulators, LLMs still do not utilize these highly-condensed knowledge sufficiently through training or prompting. To address this issue, we propose \textbf{SciDC}, an LLM generation method that integrate subject-specific knowledge with strong constraints. By adopting strong LLMs to automatically convert flexible knowledge into multi-layered, standardized rules, we build an extensible framework to effectively constrain the model generation on domain tasks. Experiments on scientific tasks including industrial formulation design, clinical tumor diagnosis and retrosynthesis planning, consistently demonstrate the effectiveness of our method, achieving a 12% accuracy improvement on average compared with vanilla generation. We further discuss the potential of LLMs in automatically inductively summarizing highly-condensed knowledge, looking ahead to practical solutions for accelerating the overall scientific research process. All the code of this paper can be obtained (https://github.com/Maotian-Ma/SciDC).

关键词: Large Language Models, Hallucination Mitigation, Scientific Knowledge, Decoding Constraints, AI for Science, Reliability Improvement, Domain Tasks, Scientific Research

163. ❌ Scoring Edit Impact in Grammatical Error Correction via Embedded Association Graphs

作者: Qiyuan Xiao, Xiaoman Wang, Yunshi Lan 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06573v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是语法错误纠正（GEC）系统中编辑影响的自动评分方法，提出了基于嵌入式关联图的框架。论文内容专注于自然语言处理中的语法纠错评估任务，使用了图神经网络和困惑度评分等技术。所有给定的关键词都涉及大模型、深度学习技术原理、AI科学应用等前沿方向，而该论文研究的是传统的语法纠错评估方法，没有涉及大模型、深度学习技术原理创新或AI在科学领域的应用，因此与所有关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了一个基于嵌入式关联图的框架来自动评估语法错误纠正系统中编辑的重要性，实验证明该方法在多个数据集和语言中优于基线方法。

摘要翻译

语法错误纠正系统通过生成一系列编辑操作来修正错误句子。这些编辑的质量通常通过与人工标注进行比对来评估。然而，一个句子可能存在多种有效修正方式，而现有评估框架未能充分适应多样化的应用场景。近期的元评估方法依赖于基于多参考译文的人工判断，但难以扩展到大规模数据集。本文提出一项新任务——语法错误纠正中的编辑影响评分，旨在自动评估语法错误纠正系统所生成编辑操作的重要性。针对该任务，我们引入基于嵌入式关联图的评分框架。该图能捕捉编辑之间以及句法相关编辑之间的潜在依赖关系，将其分组为连贯的编辑簇。随后我们执行基于困惑度的评分，以估算每个编辑对句子流畅度的贡献。在4个语法错误纠正数据集、4种语言和4个语法错误纠正系统上进行的实验表明，我们的方法在各项基准测试中均表现优异。进一步分析显示，嵌入式关联图能有效捕捉不同语言间编辑操作的结构依赖性。

摘要 (Abstract)

A Grammatical Error Correction (GEC) system produces a sequence of edits to correct an erroneous sentence. The quality of these edits is typically evaluated against human annotations. However, a sentence may admit multiple valid corrections, and existing evaluation settings do not fully accommodate diverse application scenarios. Recent meta-evaluation approaches rely on human judgments across multiple references, but they are difficult to scale to large datasets. In this paper, we propose a new task, Scoring Edit Impact in GEC, which aims to automatically estimate the importance of edits produced by a GEC system. To address this task, we introduce a scoring framework based on an embedded association graph. The graph captures latent dependencies among edits and syntactically related edits, grouping them into coherent groups. We then perform perplexity-based scoring to estimate each edit’s contribution to sentence fluency. Experiments across 4 GEC datasets, 4 languages, and 4 GEC systems demonstrate that our method consistently outperforms a range of baselines. Further analysis shows that the embedded association graph effectively captures cross-linguistic structural dependencies among edits.

关键词: Grammatical Error Correction, Edit Impact Scoring, Embedded Association Graph, Perplexity-based Scoring, Cross-linguistic Dependencies, Meta-evaluation, Sentence Fluency, GEC Evaluation

164. ❌ LLM-based Schema-Guided Extraction and Validation of Missing-Person Intelligence from Heterogeneous Data Sources

作者: Joshua Castillo, Ravi Mukkamala 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06571v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文主要研究LLM在信息提取和验证中的应用，属于大模型在不同领域的研究应用。论文明确提到使用Large Language Model (LLM)-assisted extraction pathway，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文涉及AI在调查分析中的应用，与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），但并非核心的生物信息学或化学信息学应用。其他关键词如MoE、SFT、RAG等均未在摘要中提及或相关，因此得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于LLM的解析和规范化管道（Guardian Parser Pack），用于从异构调查文档中提取和验证失踪人员情报，实验表明LLM辅助路径显著提高了提取质量（F1从0.2578提升到0.8664）和关键字段完整性（从93.23%提升到96.97%），但速度较慢。

摘要翻译

失踪人员与儿童安全调查依赖于异构的案件文档，包括结构化表格、公告式海报及叙述性网络资料。这些文档在版式、术语和数据质量上的差异阻碍了快速分诊、大规模分析及搜索计划工作流的开展。本文介绍“守护者解析器套件”（Guardian Parser Pack），这是一种基于人工智能的解析与标准化流程，可将多来源的调查文档转化为统一、符合规范模式（schema）的表示形式，适用于操作审查与下游空间建模。该系统整合了以下模块：（i）具备光学字符识别（OCR）备用功能的多引擎PDF文本提取；（ii）基于规则的来源识别与针对特定来源的解析器；（iii）以模式为先的协调与验证机制；（iv）可选的大型语言模型（LLM）辅助提取路径，该路径整合了验证器引导的修复功能与共享地理编码服务。我们阐述了系统架构、关键实施决策与输出设计，并采用黄金对齐提取指标和语料库级操作指标对性能进行评估。在75个案例的人工对齐子集上，LLM辅助路径的提取质量显著高于确定性对比路径（F1分数为0.8664对比0.2578）；同时在每条路径解析的517条记录中，LLM路径也提升了关键字段的整体完整性（96.97%对比93.23%）。确定性路径仍保持更快的处理速度（平均每条记录运行时间0.03秒，而LLM路径为3.95秒）。在评估运行中，所有LLM输出均通过了初始模式验证，因此验证器引导的修复功能充当了内置保障机制，而非观测到性能提升的主要贡献因素。这些结果支持在高风险调查场景中，在一种以模式为先、可审计的流程内受控使用概率性人工智能技术。

摘要 (Abstract)

Missing-person and child-safety investigations rely on heterogeneous case documents, including structured forms, bulletin-style posters, and narrative web profiles. Variations in layout, terminology, and data quality impede rapid triage, large-scale analysis, and search-planning workflows. This paper introduces the Guardian Parser Pack, an AI-driven parsing and normalization pipeline that transforms multi-source investigative documents into a unified, schema-compliant representation suitable for operational review and downstream spatial modeling. The proposed system integrates (i) multi-engine PDF text extraction with Optical Character Recognition (OCR) fallback, (ii) rule-based source identification with source-specific parsers, (iii) schema-first harmonization and validation, and (iv) an optional Large Language Model (LLM)-assisted extraction pathway incorporating validator-guided repair and shared geocoding services. We present the system architecture, key implementation decisions, and output design, and evaluate performance using both gold-aligned extraction metrics and corpus-level operational indicators. On a manually aligned subset of 75 cases, the LLM-assisted pathway achieved substantially higher extraction quality than the deterministic comparator (F1 = 0.8664 vs. 0.2578), while across 517 parsed records per pathway it also improved aggregate key-field completeness (96.97% vs. 93.23%). The deterministic pathway remained much faster (mean runtime 0.03 s/record vs. 3.95 s/record for the LLM pathway). In the evaluated run, all LLM outputs passed initial schema validation, so validator-guided repair functioned as a built-in safeguard rather than a contributor to the observed gains. These results support controlled use of probabilistic AI within a schema-first, auditable pipeline for high-stakes investigative settings.

关键词: Large Language Model, information extraction, schema validation, heterogeneous data, missing-person investigations, AI-driven pipeline, data normalization, OCR

165. ❌ To Lie or Not to Lie? Investigating The Biased Spread of Global Lies by LLMs

作者: Zohaib Khan, Mustafa Dogan, Ifeoma Okoh, Pouya Sadeghi, Siddhartha Shrestha, Sergius Justus Nyah, Mahmoud O. Mokhiamar, Michael J. Ryan, Tarek Naous 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06552v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在生成和传播虚假信息方面的行为模式，与’Large Language Models’高度相关（10分），并涉及虚假信息检测和事实核查，与’Hallucination Mitigation’高度相关（10分）。论文提到检索增强的事实核查，与’Retrieval-Augmented Generation’有一定关联（5分）。其他关键词如MoE、SLMs、对齐、推理加速等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了LLMs在不同语言和国家背景下生成和传播虚假信息的系统性偏见，发现LLMs在低资源语言和低人类发展指数国家中传播虚假信息的比例更高，并指出现有缓解策略存在跨语言和跨地区的不均衡保护问题。

摘要翻译

虚假信息正呈上升趋势，而大型语言模型强大的文本生成能力降低了恶意行为者制造和传播虚假信息的门槛。本研究探讨了当被要求针对不同语言和目标国家生成虚假信息时，大型语言模型的表现，并引入了GlobalLies——一个包含440个虚假信息生成提示模板和6,867个实体的多语言平行数据集，涵盖8种语言和195个国家。通过对顶尖模型生成的数十万条内容进行人工标注和大规模“LLM即评判员”评估，我们发现虚假信息的生成模式会根据所讨论的国家呈现系统性差异。在许多资源匮乏的语言中，以及人类发展指数较低的国家，大型语言模型传播谎言的比率显著更高。研究表明，现有的缓解策略提供的保护并不均衡：输入安全分类器存在跨语言差距，而检索增强的事实核查因信息可及性不均，在不同地区仍存在不一致性。我们公开发布GlobalLies数据集以供研究使用，旨在支持开发缓解策略以减少全球虚假信息的传播：https://github.com/zohaib-khan5040/globallies

摘要 (Abstract)

Misinformation is on the rise, and the strong writing capabilities of LLMs lower the barrier for malicious actors to produce and disseminate false information. We study how LLMs behave when prompted to spread misinformation across languages and target countries, and introduce GlobalLies, a multilingual parallel dataset of 440 misinformation generation prompt templates and 6,867 entities, spanning 8 languages and 195 countries. Using both human annotations and large-scale LLM-as-a-judge evaluations across hundreds of thousands of generations from state-of-the-art models, we show that misinformation generation varies systematically based on the country being discussed. Propagation of lies by LLMs is substantially higher in many lower-resource languages and for countries with a lower Human Development Index (HDI). We find that existing mitigation strategies provide uneven protection: input safety classifiers exhibit cross-lingual gaps, and retrieval-augmented fact-checking remains inconsistent across regions due to unequal information availability. We release GlobalLies for research purposes, aiming to support the development of mitigation strategies to reduce the spread of global misinformation: https://github.com/zohaib-khan5040/globallies

关键词: misinformation generation, LLM behavior, multilingual dataset, cross-lingual gaps, retrieval-augmented fact-checking, safety classifiers, GlobalLies dataset, human development index

166. ❌ CCD-CBT: Multi-Agent Therapeutic Interaction for CBT Guided by Cognitive Conceptualization Diagram

作者: Chang Liu, Changsheng Ma, Yongfeng Tao, Bin Hu, Minqiang Yang 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06551v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文核心研究大语言模型在心理健康领域的应用，具体为模拟认知行为疗法（CBT）咨询师。因此，与’Large Language Models’高度相关（10分），因为论文明确使用LLMs进行CBT模拟。与’Post-training OR Supervised Fine-tuning OR SFT’高度相关（10分），因为论文提到对模型进行微调（fine-tuned）。与’LLM Agents’和’Multi-agent Systems’高度相关（10分），因为论文提出了一个多智能体框架（multi-agent framework），包含控制代理和治疗师代理。与’AI for Science’有一定相关性（8分），因为心理健康支持可视为科学应用的一个子领域，但论文未明确提及生物信息学或化学信息学。与推理相关的关键词（‘Chain of Thought’和’System 2 Thinking’）得5分，因为治疗师代理需要从推断的客户状态进行推理，这涉及多步或深度推理，但并非论文核心焦点。其他关键词与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为CCD-CBT的多智能体框架，用于模拟认知行为疗法咨询师，通过动态重建认知概念图和信息不对称交互来提高咨询保真度和积极情感增强效果，并发布了合成数据集CCDCHAT用于模型微调。

摘要翻译

大型语言模型通过模拟认知行为疗法（CBT）咨询师，展现出提供可扩展心理健康支持的潜力。然而，现有方法通常依赖静态的认知剖面和全知型的单智能体模拟，未能捕捉真实治疗中动态且信息不对称的特性。我们提出CCD-CBT这一多智能体框架，该框架从两个维度革新了CBT模拟：1）从静态认知概念化图表（Cognitive Conceptualization Diagram, CCD）转向动态重构的CCD，由专用的控制智能体（Control Agent）实时更新；2）从全知型交互转向信息不对称交互，其中治疗师智能体（Therapist Agent）必须基于推断的来访者状态进行推理。我们发布了CCDCHAT——一个基于此框架生成的合成多轮CBT对话数据集。通过临床量表和专家治疗师的评估表明，基于CCDCHAT微调的模型在咨询忠实度和积极情绪提升方面均优于强基线模型，消融实验进一步验证了动态CCD引导与不对称智能体设计的必要性。本研究为构建理论扎实、临床可信的对话智能体提供了新范式。

摘要 (Abstract)

Large language models show potential for scalable mental-health support by simulating Cognitive Behavioral Therapy (CBT) counselors. However, existing methods often rely on static cognitive profiles and omniscient single-agent simulation, failing to capture the dynamic, information-asymmetric nature of real therapy. We introduce CCD-CBT, a multi-agent framework that shifts CBT simulation along two axes: 1) from a static to a dynamically reconstructed Cognitive Conceptualization Diagram (CCD), updated by a dedicated Control Agent, and 2) from omniscient to information-asymmetric interaction, where the Therapist Agent must reason from inferred client states. We release CCDCHAT, a synthetic multi-turn CBT dataset generated under this framework. Evaluations with clinical scales and expert therapists show that models fine-tuned on CCDCHAT outperform strong baselines in both counseling fidelity and positive-affect enhancement, with ablations confirming the necessity of dynamic CCD guidance and asymmetric agent design. Our work offers a new paradigm for building theory-grounded, clinically-plausible conversational agents.

关键词: Large Language Models, Cognitive Behavioral Therapy, Multi-agent Framework, Cognitive Conceptualization Diagram, Fine-tuning, Mental Health Support, Conversational Agents, Clinical Evaluation

167. ❌ The Illusion of Stochasticity in LLMs

作者: Xiangming Gu, Soham De, Michalis Titsias, Larisa Markeeva, Petar Veličković, Razvan Pascanu 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06543v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs作为代理（agents）时的随机采样能力缺陷，与’Large Language Models’和’LLM Agents’高度相关（10分），因为论文明确研究LLMs在代理系统中的随机采样问题。其他关键词如MoE、SFT、RAG等未在摘要中提及或与论文主题无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文揭示了大型语言模型作为代理时无法可靠地进行随机采样，其内部概率估计与随机输出之间存在根本性缺陷。

摘要翻译

本研究证明，可靠的随机采样是大型语言模型作为智能体运行时的基本需求，但当前模型尚未满足这一要求。智能体系统经常需要从分布中采样——这些分布通常基于观测数据推断得出，而这一过程需要由大型语言模型进行模拟。这揭示了一个独特的缺陷点：虽然标准的强化学习智能体依赖外部采样机制，但大型语言模型无法将其内部概率估计映射到随机输出中。通过对多种模型系列、模型规模、提示方式和概率分布进行严格实证分析，我们揭示了这一缺陷的严重程度。关键发现表明，尽管前沿大模型能够将给定的随机种子转换为目标分布，但其直接从特定分布中采样的能力存在根本性缺陷。

摘要 (Abstract)

In this work, we demonstrate that reliable stochastic sampling is a fundamental yet unfulfilled requirement for Large Language Models (LLMs) operating as agents. Agentic systems are frequently required to sample from distributions, often inferred from observed data, a process which needs to be emulated by the LLM. This leads to a distinct failure point: while standard RL agents rely on external sampling mechanisms, LLMs fail to map their internal probability estimates to their stochastic outputs. Through rigorous empirical analysis across multiple model families, model sizes, prompting styles, and distributions, we demonstrate the extent of this failure. Crucially, we show that while powerful frontier models can convert provided random seeds to target distributions, their ability to sample directly from specific distributions is fundamentally flawed.

关键词: Large Language Models, LLMs, agents, stochastic sampling, probability estimates, agentic systems, empirical analysis, distributions

168. ❌ MedConclusion: A Benchmark for Biomedical Conclusion Generation from Structured Abstracts

作者: Weiyue Li, Ruizhi Qian, Yi Li, Yongce Li, Yunfan Long, Jiahui Cai, Yan Luo, Mengyu Wang 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06505v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心是评估LLMs在生物医学领域的应用（AI for Science），直接涉及LLMs关键词（10分）。研究关注LLMs从证据到结论的推理过程，与Chain of Thought和System 2 Thinking有一定关联（各5分）。其他关键词如MoE、SFT、RAG等未在摘要中提及，与论文技术内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对LLMs从结构化生物医学证据推断科学结论的能力有限的问题，提出了包含570万PubMed摘要的MedConclusion基准数据集，并通过实验发现结论生成与摘要生成在行为上存在差异，且当前自动指标下强模型表现接近。

摘要翻译

大型语言模型（LLMs）在推理密集型研究任务中得到了广泛探索，但用于测试其能否从结构化生物医学证据中推断科学结论的资源仍然有限。我们推出了 MedConclusion，这是一个包含 570万篇 PubMed结构化摘要的大规模数据集，用于生物医学结论生成。每个实例将摘要的非结论部分与作者撰写的原始结论配对，为从证据到结论的推理提供了自然存在的监督。MedConclusion还包含期刊层面的元数据，如生物医学类别和SJR（SCImago期刊排名），支持跨生物医学领域的子组分析。作为一项初步研究，我们在结论生成和摘要生成两种提示设置下评估了多种LLMs，并使用基于参考的指标和LLM-as-a-judge（以LLM作为评判者）对输出结果进行评分。我们发现，结论撰写在行为上与摘要撰写存在明显差异，强模型在当前自动指标下仍紧密聚集，而评判者身份会显著改变绝对分数。MedConclusion为研究从科学证据到结论的推理提供了一个可复用的数据资源。我们的代码和数据可在以下网址获取：https://github.com/Harvard-AI-and-Robotics-Lab/MedConclusion。

摘要 (Abstract)

Large language models (LLMs) are widely explored for reasoning-intensive research tasks, yet resources for testing whether they can infer scientific conclusions from structured biomedical evidence remain limited. We introduce $\textbf{MedConclusion}$, a large-scale dataset of $\textbf{5.7M}$ PubMed structured abstracts for biomedical conclusion generation. Each instance pairs the non-conclusion sections of an abstract with the original author-written conclusion, providing naturally occurring supervision for evidence-to-conclusion reasoning. MedConclusion also includes journal-level metadata such as biomedical category and SJR, enabling subgroup analysis across biomedical domains. As an initial study, we evaluate diverse LLMs under conclusion and summary prompting settings and score outputs with both reference-based metrics and LLM-as-a-judge. We find that conclusion writing is behaviorally distinct from summary writing, strong models remain closely clustered under current automatic metrics, and judge identity can substantially shift absolute scores. MedConclusion provides a reusable data resource for studying scientific evidence-to-conclusion reasoning. Our code and data are available at: https://github.com/Harvard-AI-and-Robotics-Lab/MedConclusion.

关键词: Large Language Models, Biomedical Conclusion Generation, Evidence-to-Conclusion Reasoning, Benchmark Dataset, PubMed Abstracts, LLM Evaluation, Scientific Reasoning, Biomedical AI

169. ❌ Fine-tuning Whisper for Pashto ASR: strategies and scale

作者: Hanif Rahman 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06507v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于语音识别（ASR）领域，研究Whisper模型在普什图语上的微调策略，不涉及大语言模型（LLM）或深度学习技术原理的创新。仅与两个关键词相关：1）‘Post-training OR Supervised Fine-tuning OR SFT’（10分）：论文核心是比较四种微调策略（vanilla full fine-tuning等），属于监督微调范畴；2）‘PEFT OR LoRA OR Parameter-efficient Fine-tuning’（10分）：明确测试了LoRA（rank 64）作为参数高效微调方法。其他关键词均与论文内容无关（0分），例如无LLM、MoE、Scaling Laws、Alignment、RAG等元素，也不属于AI for Science的生物信息学应用。

!!! tip deepseek-chat TL;DR

该论文针对Whisper模型在普什图语自动语音识别（ASR）中性能不足的问题，通过比较多种微调策略（包括全微调、LoRA等），发现全微调在CommonVoice数据集上达到最低词错误率（WER 21.22%），并确定了whisper-small为113小时数据下的实用最优模型。

摘要翻译

普什图语虽为CommonVoice最大规模语种集合之一，却未包含在Whisper的预训练语料中，导致现成模型无法直接使用：所有规模的Whisper模型在处理普什图语音频时均输出阿拉伯文、达里文或乌尔都文字符，词错误率超过100%。本研究在CommonVoice Pashto v20数据集上比较了四种针对whisper-base的微调策略：标准全参数微调、LoRA（秩64）、冻结编码器（2/6层）以及多阶段乌尔都语到普什图语迁移学习。我们将标准微调扩展至CommonVoice Pashto v24（113小时）数据集的whisper-small和whisper-large-v3-turbo模型。标准微调在CV20上实现21.22%的词错误率，较LoRA策略提升33.36个百分点，较冻结编码器策略提升14.76个百分点，较乌尔都语迁移策略提升44.56个百分点。冻结编码器微调在whisper-base（6层编码器）上导致性能下降：该深度下无法保持层功能分离特性，且冻结操作削减了三分之一的可训练容量。乌尔都语到普什图语迁移学习因使用未验证的中间检查点、音系失配及训练不足而失败。在CV24数据集上，whisper-small达到24.89%词错误率（参数量为whisper-base的3.3倍时误差增加2.24个百分点）；whisper-large-v3-turbo达到23.37%（进一步降低1.52个百分点）。收益递减现象表明在113小时数据规模下，whisper-small是实际最优选择。在线数据增强相比匹配训练带来7.25个百分点的词错误率改善。错误分析发现词尾后缀混淆（阳性-ay与阴性-a）以及涉及普什图语独有辅音/ts/的卷舌音替换是主要错误模式。微调后的检查点与评估脚本已在HuggingFace平台发布。

摘要 (Abstract)

Pashto is absent from Whisper’s pre-training corpus despite being one of CommonVoice’s largest language collections, leaving off-the-shelf models unusable: all Whisper sizes output Arabic, Dari, or Urdu script on Pashto audio, achieving word error rates above 100%. We compare four fine-tuning strategies for whisper-base on CommonVoice Pashto v20: vanilla full fine-tuning, LoRA (rank 64), frozen-encoder (2/6 layers), and multistage Urdu-to-Pashto transfer. We extend vanilla fine-tuning to whisper-small and whisper-large-v3-turbo on CommonVoice Pashto v24 (113 hours). Vanilla fine-tuning achieves WER 21.22% on CV20, outperforming LoRA by 33.36 pp, frozen-encoder by 14.76 pp, and Urdu transfer by 44.56 pp. Frozen-encoder fine-tuning degrades performance on whisper-base (6 encoder layers): layer-function separation does not hold at this depth, and freezing removes a third of trainable capacity. Urdu-to-Pashto transfer fails due to an unverified intermediate checkpoint, phonological mismatch, and insufficient training. On CV24, whisper-small achieves WER 24.89% (2.24 pp over whisper-base at 3.3x parameters); whisper-large-v3-turbo achieves 23.37% (a further 1.52 pp). Diminishing returns indicate whisper-small is the practical optimum at 113 hours. Online augmentation provides 7.25 pp WER benefit over matched training. Error analysis identifies word-final suffix confusion (masculine -ay vs. feminine -a) and retroflex substitutions involving the Pashto-unique consonant /ts/ as dominant failure modes. Fine-tuned checkpoints and evaluation scripts are released on HuggingFace.

关键词: Whisper, Pashto ASR, fine-tuning, LoRA, CommonVoice, word error rate, transfer learning, speech recognition

170. ❌ Transformer See, Transformer Do: Copying as an Intermediate Step in Learning Analogical Reasoning

作者: Philipp Hellwig, Willem Zuidema, Claire E. Stevenson, Martha Lewis 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06501v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究Transformer在类比推理任务上的学习机制，核心贡献是发现复制任务作为中间步骤能提升类比推理能力，并进行了可解释性分析。与关键词的相关性分析：1）论文使用Transformer模型，属于大模型范畴，但与前沿LLM技术关联较弱，给5分；2）论文研究类比推理，涉及多步推理和深度思考过程，与’Chain of Thought’和’System 2 Thinking’高度相关，各给8分；3）论文进行可解释性分析，识别模型计算算法，与’Mechanistic Interpretability’高度相关，给8分；4）其他关键词如MoE、量化、RAG等与论文内容无关，给0分。

!!! tip deepseek-chat TL;DR

该论文研究Transformer如何通过包含复制任务的训练数据学习类比推理，发现这种方法能提升模型在字母串类比任务上的泛化能力，并通过可解释性分析揭示了模型的计算机制。

摘要翻译

类比推理是人类智能的标志，它使我们能够通过将知识从一种情境迁移到另一种情境来解决新问题。然而，开发能够进行稳健类人化类比推理的人工智能系统已被证明是困难的。在本研究中，我们采用元学习组合性方法训练变换器模型，使其完成一项类比推理任务（字母串类比），并评估其泛化能力。我们发现，通过在训练数据中包含复制任务来引导模型关注最具信息量的问题要素时，字母串类比任务变得可学习。此外，当使用更多样化的数据集进行训练时，模型对新字母表的泛化能力会变得更好，我们的三层编码器-解码器模型在此方面超越了大多数前沿模型。元学习组合性方法还能使模型在一定程度上泛化到已训练变换的组合，但无法泛化到完全新颖的变换。为了理解模型的运作机制，我们识别出一种近似模拟模型计算的算法。我们通过可解释性分析验证了这一点，并表明可以根据该算法推导出的预期精确地引导模型行为。最后，我们讨论了本研究结果对更大模型泛化能力的启示，以及与人类类比推理的相似之处。

摘要 (Abstract)

Analogical reasoning is a hallmark of human intelligence, enabling us to solve new problems by transferring knowledge from one situation to another. Yet, developing artificial intelligence systems capable of robust human-like analogical reasoning has proven difficult. In this work, we train transformers using Meta-Learning for Compositionality (MLC) on an analogical reasoning task (letter-string analogies) and assess their generalization capabilities. We find that letter-string analogies become learnable when guiding the models to attend to the most informative problem elements induced by including copying tasks in the training data. Furthermore, generalization to new alphabets becomes better when models are trained with more heterogeneous datasets, where our 3-layer encoder-decoder model outperforms most frontier models. The MLC approach also enables some generalization to compositions of trained transformations, but not to completely novel transformations. To understand how the model operates, we identify an algorithm that approximates the model’s computations. We verify this using interpretability analyses and show that the model can be steered precisely according to expectations derived from the algorithm. Finally, we discuss implications of our findings for generalization capabilities of larger models and parallels to human analogical reasoning.

关键词: Transformer, Analogical Reasoning, Meta-Learning for Compositionality, Generalization, Interpretability, Copying Task, Letter-string Analogies, Encoder-Decoder Model

171. ❌ Closing the Speech-Text Gap with Limited Audio for Effective Domain Adaptation in LLM-Based ASR

作者: Thibault Bañeras-Roux, Sergio Burdisso, Esaú Villatoro-Tello, Dairazalia Sánchez-Cortés, Shiran Liu, Severin Baroudi, Shashi Kumar, Hasindri Watawana, Manjunath K E, Kadri Hacioglu, Petr Motlicek, Andreas Stolcke 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06487v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究LLM-based ASR的领域适应问题，核心涉及LLM架构、领域适应和微调技术，与’Large Language Models’、‘Pre-training/Domain Adaptation’、‘Post-training/SFT’高度相关（10分）。其他关键词如MoE、SLMs、RAG、推理方法、对齐、压缩等未在摘要中提及，故评0分。

!!! tip deepseek-chat TL;DR

该论文研究了在LLM-based ASR中，如何利用少量语音数据来弥合语音-文本模态差距，以提升领域适应效果，实验表明仅使用10%目标领域语音即可达到或超越传统全数据微调的性能。

摘要翻译

传统的端到端自动语音识别系统依赖成对的语音-文本数据进行领域适应。近期基于大语言模型的ASR架构通过投影模块将语音编码器与大语言模型连接，实现了仅用文本数据进行适应。然而，这引入了模态鸿沟，因为LLM并未接触语音投影器产生的含噪声表征。我们探究少量语音数据能否缓解这种不匹配问题。我们比较了三种策略：纯文本适应、成对语音-文本适应以及混合批处理——该方法将两者结合。在领域内和跨领域场景下的实验表明，即使有限的语音数据也能持续提升性能。值得注意的是，仅使用目标领域10%（不足4小时）语音的MB方法，其词错误率已达到或超越使用完整数据集进行传统ASR微调的效果，这表明少量语音数据能提供强大的模态对齐信号。

摘要 (Abstract)

Conventional end-to-end automatic speech recognition (ASR) systems rely on paired speech-text data for domain adaptation. Recent LLM-based ASR architectures connect a speech encoder to a large language model via a projection module, enabling adaptation with text-only data. However, this introduces a modality gap, as the LLM is not exposed to the noisy representations produced by the speech projector. We investigate whether small amounts of speech can mitigate this mismatch. We compare three strategies: text-only adaptation, paired speech-text adaptation, and mixed batching (MB), which combines both. Experiments in in-domain and out-of-domain settings show that even limited speech consistently improves performance. Notably, MB using only 10% of the target-domain (less than 4 hours) speech achieves word error rates comparable to, or better than, conventional ASR fine-tuning with the full dataset, indicating that small amounts of speech provide a strong modality-alignment signal.

关键词: LLM-based ASR, domain adaptation, speech-text gap, modality alignment, fine-tuning, speech encoder, word error rate, mixed batching

172. ❌ ValueGround: Evaluating Culture-Conditioned Visual Value Grounding in MLLMs

作者: Zhipin Wang, Christoph Leiter, Christian Frey, Mohamed Hesham Ibrahim Abdalla, Josif Grabocka, Steffen Eger 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06484v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究多模态大语言模型（MLLMs）的文化价值视觉基础能力，与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为MLLMs是LLMs的扩展。与’Instruction Tuning OR Alignment OR Value Alignment’高度相关（10分），因为论文直接评估文化价值对齐在视觉模态中的表现。其他关键词如MoE、SLMs、Scaling Laws、训练方法、推理技术、压缩、代理等均未在论文中涉及，故评0分。

!!! tip deepseek-chat TL;DR

论文提出了ValueGround基准，用于评估多模态大语言模型在文化条件价值判断中的视觉基础能力，发现当响应选项可视化时，模型准确率从纯文本的72.8%下降到65.8%，尽管选项-图像对齐准确率达到92.8%。

摘要翻译

文化价值观不仅通过语言表达，也通过视觉场景和日常社会实践呈现。然而，现有对语言模型中文化价值观的评估几乎完全基于纯文本，这使得当回答选项以视觉形式呈现时，模型是否能够基于文化条件做出判断尚不明确。我们提出了ValueGround基准，用于评估多模态大语言模型（MLLMs）中基于文化的视觉价值基础。该基准基于世界价值观调查（WVS）问题构建，通过最小对比度的图像对来代表对立的回答选项，同时控制无关变量。给定一个国家、一个问题和一个图像对，模型必须在无法获取原始文本回答选项的情况下，选择最符合该国价值倾向的图像。在六个MLLMs和十三个国家的测试中，尽管选项与图像对齐的准确率达到92.8%，但平均准确率从纯文本设置的72.8%下降到选项可视化时的65.8%。性能更强的模型表现出更好的鲁棒性，但所有模型仍容易出现预测反转。我们的基准为研究文化条件价值判断的跨模态迁移提供了一个受控测试平台。

摘要 (Abstract)

Cultural values are expressed not only through language but also through visual scenes and everyday social practices. Yet existing evaluations of cultural values in language models are almost entirely text-only, making it unclear whether models can ground culture-conditioned judgments when response options are visualized. We introduce ValueGround, a benchmark for evaluating culture-conditioned visual value grounding in multimodal large language models (MLLMs). Built from World Values Survey (WVS) questions, ValueGround uses minimally contrastive image pairs to represent opposing response options while controlling irrelevant variation. Given a country, a question, and an image pair, a model must choose the image that best matches the country’s value tendency without access to the original response-option texts. Across six MLLMs and 13 countries, average accuracy drops from 72.8% in the text-only setting to 65.8% when options are visualized, despite 92.8% accuracy on option-image alignment. Stronger models are more robust, but all remain prone to prediction reversals. Our benchmark provides a controlled testbed for studying cross-modal transfer of culture-conditioned value judgments.

关键词: multimodal large language models, MLLMs, cultural values, visual value grounding, benchmark, World Values Survey, cross-modal transfer, prediction reversals

173. ❌ DataSTORM: Deep Research on Large-Scale Databases using Exploratory Data Analysis and Data Storytelling

作者: Shicheng Liu, Yucheng Jiang, Sajid Farook, Camila Nicollier Sanchez, David Fernando Castro Pena, Monica S. Lam 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06474v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是LLM-based agentic system（DataSTORM）用于结构化数据库的深度研究，因此与’LLM Agents’高度相关（10分），直接使用LLM（10分）。系统涉及多步推理（CoT: 8分）和深度分析（System 2: 8分）。系统需要检索和生成（RAG: 5分），可能涉及工具使用（5分）。其他关键词如MoE、量化、对齐等未在摘要中提及，均为0分。

!!! tip deepseek-chat TL;DR

DataSTORM是一个基于LLM的智能体系统，解决了在大型结构化数据库上进行深度研究的问题，通过迭代假设生成和定量推理，在InsightBench上实现了19.4%的相对改进，并在ACLED数据集上超越了ChatGPT Deep Research。

摘要翻译

基于大型语言模型（LLM）智能体的深度研究正成为一种强大的范式，用于多步骤信息发现、综合与分析。然而，现有方法主要关注非结构化网络数据，而在大规模结构化数据库上进行深度研究所面临的挑战仍相对缺乏探索。与基于网络的研究不同，以数据为中心的有效研究不仅需要检索和总结，更要求迭代式的假设生成、对结构化模式的定量推理，以及向连贯分析叙事的收敛。
本文提出DataSTORM，一个基于LLM的智能体系统，能够自主地在大规模结构化数据库和互联网资源中进行研究。该系统以探索性数据分析（Exploratory Data Analysis）和数据叙事（Data Storytelling）原则为基础，将结构化数据的深度研究重新定义为一种论点驱动的分析过程：从数据中发现候选论点，通过迭代式的跨源调查进行验证，并将其发展为连贯的分析叙事。我们在InsightBench上评估DataSTORM，该系统取得了新的最优结果，在洞察级召回率上实现了19.4%的相对提升，在摘要级得分上提升了7.2%。我们进一步引入了一个基于真实世界复杂数据库ACLED构建的新数据集，并证明DataSTORM在自动化指标和人工评估中均优于ChatGPT深度研究等专有系统。

摘要 (Abstract)

Deep research with Large Language Model (LLM) agents is emerging as a powerful paradigm for multi-step information discovery, synthesis, and analysis. However, existing approaches primarily focus on unstructured web data, while the challenges of conducting deep research over large-scale structured databases remain relatively underexplored. Unlike web-based research, effective data-centric research requires more than retrieval and summarization and demands iterative hypothesis generation, quantitative reasoning over structured schemas, and convergence toward a coherent analytical narrative. In this paper, we present DataSTORM, an LLM-based agentic system capable of autonomously conducting research across both large-scale structured databases and internet sources. Grounded in principles from Exploratory Data Analysis and Data Storytelling, DataSTORM reframes deep research over structured data as a thesis-driven analytical process: discovering candidate theses from data, validating them through iterative cross-source investigation, and developing them into coherent analytical narratives. We evaluate DataSTORM on InsightBench, where it achieves a new state-of-the-art result with a 19.4% relative improvement in insight-level recall and 7.2% in summary-level score. We further introduce a new dataset built on ACLED, a real-world complex database, and demonstrate that DataSTORM outperforms proprietary systems such as ChatGPT Deep Research across both automated metrics and human evaluations.

关键词: LLM agents, structured databases, exploratory data analysis, data storytelling, multi-step reasoning, autonomous research, quantitative reasoning, analytical narrative

174. ❌ Multi-objective Evolutionary Merging Enables Efficient Reasoning Models

作者: Mario Iacobelli, Adrian Robert Minut, Tommaso Mencattini, Donato Crisostomi, Andrea Santilli, Iacopo Masi, Emanuele Rodolà 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06465v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	10.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）的推理效率优化，通过多目标进化模型合并（Model Merging）方法，在保持Chain of Thought/System 2 Thinking推理能力的同时减少输出长度。因此与’Large Language Models’、‘Chain of Thought’、‘System 2 Thinking’、‘Model Merging’高度相关（10分）。其他关键词如MoE、SFT、RAG等未在论文中涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文针对推理模型在保持高准确率的同时减少输出长度的挑战，提出了Evo-L2S框架，通过多目标进化模型合并方法，在多个数学推理基准上实现了推理轨迹长度减少超过50%且准确率保持或提升。

摘要翻译

推理模型通过利用长链思维已展现出解决复杂问题的卓越能力。然而，这种更为审慎的推理过程在推断时伴随着巨大的计算开销。长到短（Long-to-Short，L2S）推理问题旨在使用更少的标记词保持高精度，但当前无需训练（training-free）的模型融合方法依赖于标量化、固定超参数的算术方法，这些方法极其脆弱且被迫做出次优的折衷。为弥补这一不足，我们提出了Evo-L2S——一个将L2S推理构建为多目标优化挑战的新型框架。通过利用进化模型融合技术，Evo-L2S显式地优化了精度与输出长度之间的权衡，从而生成一个稳健的融合模型帕累托前沿。为使这种搜索在大型语言模型上具有计算可行性，我们提出了一种基于熵的子集采样技术，大幅降低了适应度估计的开销。在六个数学推理基准上，对1.5B、7B和14B参数规模进行的全面实验表明，Evo-L2S能将生成的推理轨迹长度减少50%以上，同时保持甚至提升原始推理模型的问题解决精度。

摘要 (Abstract)

Reasoning models have demonstrated remarkable capabilities in solving complex problems by leveraging long chains of thought. However, this more deliberate reasoning comes with substantial computational overhead at inference time. The Long-to-Short (L2S) reasoning problem seeks to maintain high accuracy using fewer tokens, but current training-free model merging approaches rely on scalarized, fixed-hyperparameter arithmetic methods that are highly brittle and force suboptimal compromises. To address this gap, we introduce Evo-L2S, a novel framework that formulates L2S reasoning as a multi-objective optimization challenge. By leveraging evolutionary model merging, Evo-L2S explicitly optimizes the trade-off between accuracy and output length to produce a robust Pareto front of merged models. To make this search computationally tractable for large language models, we propose an entropy-based subset sampling technique that drastically reduces the overhead of fitness estimation. Comprehensive experiments across 1.5B, 7B, and 14B parameter scales on six mathematical reasoning benchmarks demonstrate that Evo-L2S can reduce the length of generated reasoning traces by over 50% while preserving, or even improving, the problem-solving accuracy of the original reasoning models.

关键词: reasoning models, model merging, multi-objective optimization, evolutionary algorithm, Long-to-Short reasoning, computational efficiency, Pareto front, mathematical reasoning

175. ❌ Context-Aware Dialectal Arabic Machine Translation with Interactive Region and Register Selection

作者: Afroza Nowshin, Prithweeraj Acharjee Porag, Haziq Jeelani, Fayeq Jeelani Syed 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06456v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究方言阿拉伯语机器翻译，通过规则数据增强和微调mT5模型实现可控翻译。仅与"Post-training OR Supervised Fine-tuning OR SFT"有一定关联（5分），因为使用了微调技术。其他关键词均未涉及大模型技术原理创新或科学领域应用，完全无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对阿拉伯语方言机器翻译中方言多样性处理不足的问题，提出了一种基于规则数据增强和条件微调的可控翻译框架，在保持方言特异性方面优于基线模型但牺牲了部分BLEU分数。

摘要翻译

当前针对阿拉伯语的机器翻译系统往往难以处理方言多样性，常将方言输入同质化为现代标准阿拉伯语，且用户对目标方言的控制能力有限。本研究提出一种面向阿拉伯语方言翻译的语境感知可调控框架，该框架能显式建模地域与社会语言变异。我们的主要技术贡献是构建了一套基于规则的数据增强流程，将包含3000句的种子语料库扩展为平衡的57000句平行数据集，涵盖埃及、黎凡特、海湾等八个地区变体。通过对mT5-base模型进行基于轻量级元数据标签的微调，该方法实现了翻译输出中跨方言与社会语域的可控生成。
结合自动评估与定性分析，我们观察到明显的准确度与忠实度权衡：高资源基线模型（如NLLB）通过默认偏向现代标准阿拉伯语均值获得了更高的整体BLEU分数（13.75），但方言特异性表现有限；相比之下，我们的模型虽然BLEU分数较低（8.19），却能产出更贴近目标地区变体的译文。支持性定性评估（包括基于大语言模型的文化真实性分析）表明，相较于基线系统（4.80/5对比1.0/5），本模型在方言对齐方面有所提升。这些发现揭示了标准机器翻译评估指标在方言敏感任务中的局限性，并表明需要建立能更好反映阿拉伯语机器翻译中语言多样性的评估实践。

摘要 (Abstract)

Current Machine Translation (MT) systems for Arabic often struggle to account for dialectal diversity, frequently homogenizing dialectal inputs into Modern Standard Arabic (MSA) and offering limited user control over the target vernacular. In this work, we propose a context-aware and steerable framework for dialectal Arabic MT that explicitly models regional and sociolinguistic variation. Our primary technical contribution is a Rule-Based Data Augmentation (RBDA) pipeline that expands a 3,000-sentence seed corpus into a balanced 57,000-sentence parallel dataset, covering eight regional varieties eg., Egyptian, Levantine, Gulf, etc. By fine-tuning an mT5-base model conditioned on lightweight metadata tags, our approach enables controllable generation across dialects and social registers in the translation output. Through a combination of automatic evaluation and qualitative analysis, we observe an apparent accuracy-fidelity trade-off: high-resource baselines such as NLLB (No Language Left Behind) achieve higher aggregate BLEU scores (13.75) by defaulting toward the MSA mean, while exhibiting limited dialectal specificity. In contrast, our model achieves lower BLEU scores (8.19) but produces outputs that align more closely with the intended regional varieties. Supporting qualitative evaluation, including an LLM-assisted cultural authenticity analysis, suggests improved dialectal alignment compared to baseline systems (4.80/5 vs. 1.0/5). These findings highlight the limitations of standard MT metrics for dialect-sensitive tasks and motivate the need for evaluation practices that better reflect linguistic diversity in Arabic MT.

关键词: Dialectal Arabic Machine Translation, Rule-Based Data Augmentation, Controllable Generation, Regional Varieties, Sociolinguistic Variation, mT5-base, BLEU Score, Cultural Authenticity

176. ❌ Learning to Interrupt in Language-based Multi-agent Communication

作者: Danqing Wang, Da Yin, Ruta Desai, Lei Li, Asli Celikyilmaz, Ansong Ni 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06452v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	5.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究基于LLM的多智能体通信中断机制，与’Large Language Models’、‘LLM Agents’、‘Multi-agent Systems’高度相关（10分），因为论文明确使用LLMs构建多智能体系统并研究其通信优化。与’Context Window Extension’有一定关联（5分），因为研究目标包括减少冗长输出以缓解上下文过载问题。其他关键词如MoE、SFT、RAG等未在论文中涉及，均为0分。

!!! tip deepseek-chat TL;DR

该论文针对基于大语言模型的多智能体系统中通信冗长导致上下文过载和计算成本高的问题，提出了一种可中断通信框架HANDRAISER，通过预测最佳中断点来减少32.2%的通信成本，同时保持或提升任务性能。

摘要翻译

采用大型语言模型（LLM）的多智能体系统已在多个领域展现出卓越能力。然而，当前智能体间的通信存在输出冗长的问题，导致上下文负载过重并增加计算成本。尽管现有方法侧重于从发言者端压缩信息，但它们难以适应不同听众并识别相关信息。人类沟通中的一个有效方式是允许倾听者打断发言并表达观点或请求澄清。受此启发，我们提出了一种可中断的通信框架，允许正在倾听的智能体打断当前发言者。通过提示实验，我们发现当前的大型语言模型常表现出过度自信，在接收足够信息前便进行打断。因此，我们提出一种学习方法，基于预估的未来收益与成本预测合适的打断时机。我们在多种多智能体场景中评估了该框架，包括双智能体文本猜词游戏、三智能体会议安排以及三智能体辩论。实验结果表明，与基线方法相比，我们的HANDRAISER框架在保持相当或更优任务性能的同时，能够降低32.2%的通信成本。这种习得的打断行为还可泛化至不同的智能体与任务。

摘要 (Abstract)

Multi-agent systems using large language models (LLMs) have demonstrated impressive capabilities across various domains. However, current agent communication suffers from verbose output that overload context and increase computational costs. Although existing approaches focus on compressing the message from the speaker side, they struggle to adapt to different listeners and identify relevant information. An effective way in human communication is to allow the listener to interrupt and express their opinion or ask for clarification. Motivated by this, we propose an interruptible communication framework that allows the agent who is listening to interrupt the current speaker. Through prompting experiments, we find that current LLMs are often overconfident and interrupt before receiving enough information. Therefore, we propose a learning method that predicts the appropriate interruption points based on the estimated future reward and cost. We evaluate our framework across various multi-agent scenarios, including 2-agent text pictionary games, 3-agent meeting scheduling, and 3-agent debate. The results of the experiment show that our HANDRAISER can reduce the communication cost by 32.2% compared to the baseline with comparable or superior task performance. This learned interruption behavior can also be generalized to different agents and tasks.

关键词: multi-agent systems, large language models, agent communication, interruption mechanism, communication cost reduction, context window optimization, HANDRAISER framework, agent coordination

177. ❌ The Depth Ceiling: On the Limits of Large Language Models in Discovering Latent Planning

作者: Yi Xu, Philipp Jettkant, Laura Ruis 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06427v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	5.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs的潜在推理能力限制，特别是与Chain of Thought（CoT）推理相关，因此’Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’高度相关（10分）。论文涉及LLMs的潜在规划深度，与’System 2 Thinking OR Slow Thinking OR In-depth Reasoning’相关（8分）。论文测试了GPT-4o、Qwen3-32B等模型，因此’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文提到few-shot prompting和fine-tuned模型，因此’In-context Learning OR Many-shot Learning’和’Post-training OR Supervised Fine-tuning OR SFT’有一定关联（各5分）。其他关键词如MoE、Scaling Laws、RAG等未在论文中涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型在无监督中间步骤的情况下发现和执行多步潜在规划策略的能力极限，发现模型在训练中最多能学习五步潜在规划，但测试时能泛化到八步，揭示了发现策略与执行策略之间的分离。

摘要翻译

思维链（CoT）监控的可行性依赖于模型无法在其潜在表征中进行有效推理。然而，对于大型语言模型中此类潜在推理的极限，我们知之甚少。我们通过研究模型能否在无中间步骤监督的情况下，于单次前向传播中发现多步规划策略并潜在执行，来测试这些极限。利用可精确控制所需潜在规划步数的图路径查找任务，我们揭示了一个大规模扩展亦未解决的显著局限：从头训练的小型变换器（transformers）最多能发现需要三步潜在步骤的策略，经微调的GPT-4o和Qwen3-32B可达五步，而GPT-5.4在少量样本提示下能达到七步。尽管模型在训练中能习得的潜在规划深度上限为五步，但所发现的策略在测试时能泛化至八步潜在步骤。这揭示了仅凭最终答案监督发现潜在策略的能力，与策略一旦被发现后的执行能力之间存在分离。若类似局限在更广泛范围内成立，则可能需要明确教授或外化那些需要多步协调潜在规划的策略，这为思维链监控提供了依据。

摘要 (Abstract)

The viability of chain-of-thought (CoT) monitoring hinges on models being unable to reason effectively in their latent representations. Yet little is known about the limits of such latent reasoning in LLMs. We test these limits by studying whether models can discover multi-step planning strategies without supervision on intermediate steps and execute them latently, within a single forward pass. Using graph path-finding tasks that precisely control the number of required latent planning steps, we uncover a striking limitation unresolved by massive scaling: tiny transformers trained from scratch discover strategies requiring up to three latent steps, fine-tuned GPT-4o and Qwen3-32B reach five, and GPT-5.4 attains seven under few-shot prompting. Although the maximum latent planning depth models can learn during training is five, the discovered strategy generalizes up to eight latent steps at test-time. This reveals a dissociation between the ability to discover a latent strategy under final-answer supervision alone and the ability to execute it once discovered. If similar limits hold more broadly, strategies requiring multiple coordinated latent planning steps may need to be explicitly taught or externalized, lending credence to CoT monitoring.

关键词: Large Language Models, Chain of Thought, Latent Planning, Multi-step Reasoning, Graph Path-finding, Scaling Limits, Few-shot Prompting, Generalization

178. ❌ Team Fusion@ SU@ BC8 SympTEMIST track: transformer-based approach for symptom recognition and linking

作者: Georgi Grazhdanski, Sylvia Vassileva, Ivan Koychev, Svetla Boytcheva 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06424v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文主要研究基于Transformer的医疗症状命名实体识别和实体链接，使用了RoBERTa和SapBERT等预训练模型进行微调。与大多数大模型技术关键词（如LLMs、MoE、Scaling Laws等）无关，因为这些关键词涉及大模型架构、训练方法、推理优化等前沿技术，而本文是特定领域的应用研究。仅与两个关键词相关：1）‘Post-training OR Supervised Fine-tuning OR SFT’：论文明确提到对RoBERTa进行微调（fine-tune），属于监督微调范畴，给5分（有一定关联，但非核心创新）。2）‘AI for Science OR Bioinformatics OR Cheminformatics’：论文应用于医疗症状识别（SympTEMIST任务），属于生物信息学/科学AI应用，给8分（高度相关，是核心应用领域）。其他关键词如RAG、CoT、Agents等均未涉及。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于Transformer的方法，通过微调RoBERTa模型和利用SapBERT进行实体链接，解决了医疗症状命名实体识别和实体链接任务，并发现知识库选择对模型准确性影响最大。

摘要翻译

本文提出了一种基于Transformer的方法来解决SympTEMIST命名实体识别与实体链接任务。在命名实体识别方面，我们在增强的训练集上对基于RoBERTa的（1）词元级分类器进行了微调，并引入了双向长短期记忆网络与条件随机场层。实体链接则通过使用跨语言SapBERT XLMR-Large模型（2）生成候选实体，并计算其与知识库中实体的余弦相似度来完成。实验证明，知识库的选择对模型准确率具有最显著的影响。

摘要 (Abstract)

This paper presents a transformer-based approach to solving the SympTEMIST named entity recognition (NER) and entity linking (EL) tasks. For NER, we fine-tune a RoBERTa-based (1) token-level classifier with BiLSTM and CRF layers on an augmented train set. Entity linking is performed by generating candidates using the cross-lingual SapBERT XLMR-Large (2), and calculating cosine similarity against a knowledge base. The choice of knowledge base proves to have the highest impact on model accuracy.

关键词: transformer-based approach, named entity recognition, entity linking, RoBERTa, fine-tuning, SapBERT, symptom recognition, knowledge base

179. ❌ Attention Flows: Tracing LLM Conceptual Engagement via Story Summaries

作者: Rebecca M. M. Hicke, Sil Hamilton, David Mimno, Ross Deans Kristensen-McLachlan 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06416v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	5.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在长文本理解任务（小说摘要生成）中的表现，直接涉及LLM技术评估，因此’Large Language Models’得10分。研究关注LLM处理长文本的能力，与’Context Window Extension’有一定关联（5分）。通过比较人类和LLM的摘要模式来理解模型行为，涉及解释性分析，与’Mechanistic Interpretability’有一定关联（5分）。其他关键词如MoE、SFT、RAG等未在摘要中提及，与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该研究通过比较人类和九种先进LLM生成的小说摘要，评估了LLM在长文本叙事理解中的表现，发现模型在注意力分布上存在偏差（更关注文本结尾），揭示了LLM叙事理解的局限性。

摘要翻译

尽管大语言模型的上下文长度不断增长，有证据表明其整合长文本信息的能力并未同步提升。我们评估了其中一项理解任务：生成小说摘要。当人类撰写者压缩故事时，其摘要内容揭示了他们认定的叙事核心要素。因此，通过比较人类与大语言模型撰写的摘要，我们可以评估模型是否复现了人类处理文本概念的模式。为衡量概念关注度，我们将150篇人工撰写的小说摘要中的句子与其对应的具体章节进行对齐。我们证明了这一对齐任务的难度，这反映出摘要生成任务本身的复杂性。随后，我们使用九种前沿大语言模型为150篇参考文本分别生成摘要并进行对齐。通过对比人类与模型生成的摘要，我们发现两者在文本风格上存在差异，且在叙事过程中的关注分布模式也不同——模型倾向于强调文本结尾部分。将人类叙事关注模式与模型注意力机制进行对比，为解释模型叙事理解能力不足提供了依据，并为未来发展指明了改进方向。我们公开本数据集以支持后续研究。

摘要 (Abstract)

Although LLM context lengths have grown, there is evidence that their ability to integrate information across long-form texts has not kept pace. We evaluate one such understanding task: generating summaries of novels. When human authors of summaries compress a story, they reveal what they consider narratively important. Therefore, by comparing human and LLM-authored summaries, we can assess whether models mirror human patterns of conceptual engagement with texts. To measure conceptual engagement, we align sentences from 150 human-written novel summaries with the specific chapters they reference. We demonstrate the difficulty of this alignment task, which indicates the complexity of summarization as a task. We then generate and align additional summaries by nine state-of-the-art LLMs for each of the 150 reference texts. Comparing the human and model-authored summaries, we find both stylistic differences between the texts and differences in how humans and LLMs distribute their focus throughout a narrative, with models emphasizing the ends of texts. Comparing human narrative engagement with model attention mechanisms suggests explanations for degraded narrative comprehension and targets for future development. We release our dataset to support future research.

关键词: LLM, summarization, narrative understanding, attention mechanisms, long-form texts, human-model comparison, conceptual engagement, novel summaries

180. ❌ State-of-the-Art Arabic Language Modeling with Sparse MoE Fine-Tuning and Chain-of-Thought Distillation

作者: Navan Preet Singh, Anurag Garikipati, Ahmed Abulkhair, Jyani Akshay Jagdishbhai, Atul Yaduvanshi, Amarendra Chaudhary, Madalina Ciobanu, Qingqing Mao, Ritankar Das 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06421v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	5.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究稀疏MoE架构的阿拉伯语LLM（Arabic-DeepSeek-R1），使用四阶段CoT蒸馏方案，在阿拉伯语基准测试中超越GPT-5.1。高度相关关键词：LLMs（核心）、MoE（稀疏MoE骨干）、CoT Reasoning（CoT蒸馏方案）。中等相关关键词：Data Quality（双语数据管理）、Domain Adaptation（阿拉伯语适应）、SFT（微调）、Alignment（伦理规范整合）、PEFT（参数高效适应）、RAG（检索增强基准）、System 2 Thinking（推理）、Factuality（安全性基准）。无关关键词：SLMs、RLHF、长上下文、推理加速、代理等未涉及。

!!! tip deepseek-chat TL;DR

该论文通过稀疏MoE架构和阿拉伯语特定的CoT蒸馏方案，开发了Arabic-DeepSeek-R1模型，在阿拉伯语基准测试中首次系统性超越GPT-5.1，证明了参数高效适应可在低资源语言中实现突破性性能。

摘要翻译

本文介绍了Arabic-DeepSeek-R1，这是一个应用驱动的开源阿拉伯语大语言模型（LLM），其采用稀疏混合专家（MoE）主干网络，旨在解决代表性不足语言的数字公平性差距，并在整个开放阿拉伯语大语言模型排行榜（Open Arabic LLM Leaderboard, OALL）上确立了新的最高技术水平（SOTA）。我们设计的四阶段思维链（CoT）蒸馏方案，将阿拉伯语特有的语言验证和区域伦理规范，整合到一个经过污染控制、包含3.72亿词元、阿拉伯语-英语比例为80/20的训练数据混合体中。Arabic-DeepSeek-R1在OALL包含的七项基准测试套件中取得了最高的平均分，同时在多项测试中确立或接近SOTA水平，包括在侧重语法的MadinahQA上取得主导性结果（以显著优势超越GPT-5.1和OALL榜首模型）、面向安全的AraTrust、多能力评估的AlGhafa以及检索增强的ALRAGE。我们的结果表明，稀疏MoE架构、融合文化认知并包含明确阿拉伯语语言检查的CoT蒸馏，以及策略性的双语数据策展相结合，使得一个开源适配模型能够在评估综合性语言特定任务的大多数基准上，系统性地超越专有前沿系统GPT-5.1：这是阿拉伯语大语言模型的首次此类展示。这些发现表明，当前大语言模型生态系统中阿拉伯语性能不足的主要原因在于专业化程度不够，而非架构限制；同时，对开源推理模型进行参数高效的适配，无需工业规模的预训练成本即可实现突破性的SOTA性能。Arabic-DeepSeek-R1为具有主权性和领域特定性的语言技术建立了一个经过验证且可复现的框架，证明了基于文化背景对稀疏MoE主干网络进行战略性适配，为低资源语言在标准化基准测试中实现破纪录的性能，提供了一条可行且高性价比的路径。

摘要 (Abstract)

This paper introduces Arabic-DeepSeek-R1, an application-driven open-source Arabic LLM that leverages a sparse MoE backbone to address the digital equity gap for under-represented languages, and establishes a new SOTA across the entire Open Arabic LLM Leaderboard (OALL). Our four-phase CoT distillation scheme integrates Arabic-specific linguistic verification and regional ethical norms into a 372M-token, contamination-controlled 80/20 Arabic-English training mixture. Arabic-DeepSeek-R1 achieves the highest average score across the seven-benchmark OALL suite while establishing SOTA or near-SOTA, including dominant results on grammar-focused MadinahQA (surpassing both GPT-5.1 and the OALL leader by substantial margins), safety-oriented AraTrust, multi-ability AlGhafa, and retrieval-augmented ALRAGE. Our results indicate that the combination of sparse MoE architecture, culturally-informed CoT distillation with explicit Arabic linguistic checks, and strategic bilingual data curation enables an open-source adapted model to systematically outperform the proprietary frontier system GPT-5.1 on the majority of benchmarks evaluating comprehensive language-specific tasks: the first such demonstration for Arabic LLMs. These findings indicate that much of Arabic’s performance deficit in current LLM ecosystems stems from under-specialization rather than architectural limitations, and that parameter-efficient adaptation of open reasoning models can yield breakthrough SOTA performance without industrial-scale pretraining costs. Arabic-DeepSeek-R1 establishes a validated and replicable framework for sovereign and domain-specific language technologies, demonstrating that strategic, culturally-grounded adaptation of sparse MoE backbones offers a viable and cost-effective pathway to achieving record-breaking performance across standardized benchmarks for low-resource languages.

关键词: Arabic LLM, Sparse MoE, Chain-of-Thought distillation, Parameter-efficient adaptation, Bilingual data curation, Open Arabic LLM Leaderboard, Cultural adaptation, Low-resource languages

181. ❌ Say Something Else: Rethinking Contextual Privacy as Information Sufficiency

作者: Yunze Xiao, Wenkai Li, Xiaoyuan Wu, Ningshan Ma, Yueqi Song, Weihao Xuan 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06409v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究LLM代理在隐私保护通信中的应用，核心涉及LLM代理（LLM Agents）和基础大模型（Large Language Models），因此这两个关键词高度相关（10分）。其他关键词如MoE、量化、推理加速、科学AI等均未在摘要中提及，与论文技术内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文研究了LLM代理在隐私保护通信中的信息充分性问题，提出了自由文本假名化策略和对话评估协议，发现假名化在隐私-效用权衡上表现最佳，且单消息评估会系统性低估信息泄露风险。

摘要翻译

随着大语言模型代理日益频繁地代表用户起草信息，用户却常常过度分享敏感信息，且对何为隐私存在分歧。现有系统仅支持抑制（省略敏感信息）和泛化（用抽象概念替换信息）两种策略，且通常仅在单条孤立信息上进行评估，这使得策略空间和评估设置均不完整。我们将保护隐私的大语言模型通信形式化为一项信息充分性任务，引入自由文本假名化作为第三种策略——该策略以功能等效的替代项替换敏感属性，并提出一种对话式评估协议，用于在现实的多轮追问压力下评估策略。我们在涵盖三种权力关系类型（制度性、同侪、亲密关系）和三种敏感类别（歧视风险、社会成本、边界）的792个场景中，从隐私性（两个粒度）、隐蔽性和实用性三个维度评估了七个前沿大语言模型。假名化策略整体上实现了最强的隐私-效用权衡，而单条信息评估系统性地低估了信息泄露风险，其中泛化策略在追问下隐私性损失最高可达16.3个百分点。

摘要 (Abstract)

LLM agents increasingly draft messages on behalf of users, yet users routinely overshare sensitive information and disagree on what counts as private. Existing systems support only suppression (omitting sensitive information) and generalization (replacing information with an abstraction), and are typically evaluated on single isolated messages, leaving both the strategy space and evaluation setting incomplete. We formalize privacy-preserving LLM communication as an \textbf{Information Sufficiency (IS)} task, introduce \textbf{free-text pseudonymization} as a third strategy that replaces sensitive attributes with functionally equivalent alternatives, and propose a \textbf{conversational evaluation protocol} that assesses strategies under realistic multi-turn follow-up pressure. Across 792 scenarios spanning three power-relation types (institutional, peer, intimate) and three sensitivity categories (discrimination risk, social cost, boundary), we evaluate seven frontier LLMs on privacy at two granularities, covertness, and utility. Pseudonymization yields the strongest privacy\textendash utility tradeoff overall, and single-message evaluation systematically underestimates leakage, with generalization losing up to 16.3 percentage points of privacy under follow-up.

关键词: LLM agents, privacy-preserving communication, information sufficiency, free-text pseudonymization, conversational evaluation, privacy-utility tradeoff, contextual privacy

182. ❌ TC-AE: Unlocking Token Capacity for Deep Compression Autoencoders

作者: Teng Li, Ziyuan Huang, Cong Chen, Yangfu Li, Yuanhuiyi Lyu, Dandan Zheng, Chunhua Shen, Jun Zhang 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07340v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文TC-AE专注于视觉生成领域的深度压缩自编码器，使用ViT架构解决高压缩比下的潜在表示崩溃问题。所有关键词均针对大语言模型（LLM）相关技术，而本文研究的是计算机视觉中的图像压缩和生成，与LLM无直接关联。唯一略有相关的是’Quantization OR Model Compression OR Low-bit Weights’，因为论文涉及模型压缩（深度压缩），但并非LLM的量化或低比特权重技术，因此给5分（有一定关联）。其他关键词如MoE、Scaling Laws、RLHF、RAG等均与本文视觉生成研究无关。

!!! tip deepseek-chat TL;DR

论文提出TC-AE，一种基于ViT的深度压缩自编码器架构，通过分解token-to-latent压缩和增强token语义结构，解决了高压缩比下潜在表示崩溃的问题，显著提升了重建和生成性能。

摘要翻译

我们提出TC-AE，一种基于视觉Transformer（ViT）的深度压缩自编码器架构。现有方法通常通过增加潜在表征的通道数来维持高压缩比下的重建质量，但这种策略常导致潜在表征崩溃，从而降低生成性能。TC-AE并未依赖日益复杂的架构或多阶段训练方案，而是从像素与图像潜在表征之间的关键桥梁——令牌空间的角度出发，通过两项互补的创新应对这一挑战：首先，我们在固定潜在表征预算下，通过调整ViT的补丁尺寸研究令牌数量缩放，并发现激进的令牌到潜在表征压缩是限制有效缩放的关键因素。为解决此问题，我们将令牌到潜在表征压缩分解为两个阶段，减少结构信息损失，并为生成任务实现有效的令牌数量缩放。其次，为进一步缓解潜在表征崩溃，我们通过联合自监督训练增强图像令牌的语义结构，从而获得更利于生成的潜在表征。基于这些设计，TC-AE在深度压缩条件下实现了显著提升的重建与生成性能。我们希望本研究能推动基于ViT的视觉生成令牌化器的发展。

摘要 (Abstract)

We propose TC-AE, a ViT-based architecture for deep compression autoencoders. Existing methods commonly increase the channel number of latent representations to maintain reconstruction quality under high compression ratios. However, this strategy often leads to latent representation collapse, which degrades generative performance. Instead of relying on increasingly complex architectures or multi-stage training schemes, TC-AE addresses this challenge from the perspective of the token space, the key bridge between pixels and image latents, through two complementary innovations: Firstly, we study token number scaling by adjusting the patch size in ViT under a fixed latent budget, and identify aggressive token-to-latent compression as the key factor that limits effective scaling. To address this issue, we decompose token-to-latent compression into two stages, reducing structural information loss and enabling effective token number scaling for generation. Secondly, to further mitigate latent representation collapse, we enhance the semantic structure of image tokens via joint self-supervised training, leading to more generative-friendly latents. With these designs, TC-AE achieves substantially improved reconstruction and generative performance under deep compression. We hope our research will advance ViT-based tokenizer for visual generation.

关键词: TC-AE, deep compression autoencoders, ViT-based architecture, token capacity, latent representation collapse, token-to-latent compression, visual generation, generative performance

183. ❌ From Blobs to Spokes: High-Fidelity Surface Reconstruction via Oriented Gaussians

作者: Diego Gomez, Antoine Guédon, Nissim Maruani, Bingchen Gong, Maks Ovsjanikov 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07337v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于3D高斯泼溅的表面重建技术，提出Gaussian Wrapping方法从高斯元素中提取水密网格，属于计算机视觉和图形学领域。所有评分关键词均与大语言模型、深度学习技术原理、AI科学应用等主题相关，而本文研究的是3D重建的几何表示和网格提取，与这些关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文解决了3D高斯泼溅难以提取准确表面的问题，通过引入可学习的定向法线和改进的衰减公式，实现了从高斯表示中提取高质量水密网格的新方法。

摘要翻译

三维高斯泼溅（3D Gaussian Splatting，简称3DGS）彻底革新了快速新视角合成技术，但其基于不透明度的建模方式使得表面提取从根本上变得困难。与基于有向距离场或占据场的隐式方法不同，3DGS缺乏全局几何场，迫使现有方法只能依赖启发式策略，例如混合深度图的TSDF融合。
受“物体即体积”框架的启发，我们为高斯泼溅推导出一个理论完备的占据场，并展示了如何利用它来提取复杂场景的高精度水密网格。我们的核心贡献在于为每个高斯元素引入一个可学习的定向法向量，并定义了一种适配的衰减公式，从而推导出空间中任意位置的法向量场和占据场的闭式表达式。我们进一步提出了一种新颖的一致性损失和专门的致密化策略，通过填补几何空洞强制高斯元素包裹整个表面，确保形成一个完整的定向基元外壳。我们修改了可微分光栅化器，使其输出深度作为我们连续模型的等值面，并引入了“原始自适应网格化”方法，以实现任意分辨率下感兴趣区域的网格生成。
此外，我们揭示了标准表面评估协议中存在的根本性偏差，并提出了两种更为严谨的替代方案。总体而言，我们的方法“高斯包裹”在DTU及Tanks and Temples数据集上确立了新的技术标杆，能够以远低于同期工作的数据量生成完整的水密网格，并成功重建出诸如 notoriously elusive 的自行车辐条等细微结构。

摘要 (Abstract)

3D Gaussian Splatting (3DGS) has revolutionized fast novel view synthesis, yet its opacity-based formulation makes surface extraction fundamentally difficult. Unlike implicit methods built on Signed Distance Fields or occupancy, 3DGS lacks a global geometric field, forcing existing approaches to resort to heuristics such as TSDF fusion of blended depth maps. Inspired by the Objects as Volumes framework, we derive a principled occupancy field for Gaussian Splatting and show how it can be used to extract highly accurate watertight meshes of complex scenes. Our key contribution is to introduce a learnable oriented normal at each Gaussian element and to define an adapted attenuation formulation, which leads to closed-form expressions for both the normal and occupancy fields at arbitrary locations in space. We further introduce a novel consistency loss and a dedicated densification strategy to enforce Gaussians to wrap the entire surface by closing geometric holes, ensuring a complete shell of oriented primitives. We modify the differentiable rasterizer to output depth as an isosurface of our continuous model, and introduce Primal Adaptive Meshing for Region-of-Interest meshing at arbitrary resolution. We additionally expose fundamental biases in standard surface evaluation protocols and propose two more rigorous alternatives. Overall, our method Gaussian Wrapping sets a new state-of-the-art on DTU and Tanks and Temples, producing complete, watertight meshes at a fraction of the size of concurrent work-recovering thin structures such as the notoriously elusive bicycle spokes.

关键词: 3D Gaussian Splatting, surface reconstruction, oriented Gaussians, watertight meshes, occupancy field, differentiable rasterizer, geometric holes, thin structures

184. ❌ Beyond Loss Values: Robust Dynamic Pruning via Loss Trajectory Alignment

作者: Huaiyuan Qin, Muli Yang, Gabriel James Goenawan, Kai Wang, Zheng Wang, Peng Hu, Xi Peng, Hongyuan Zhu 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07306v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于动态数据剪枝方法，特别是针对噪声标签设置下的鲁棒性改进，提出了一种基于损失轨迹对齐的新方法AlignPrune。所有关键词均与大模型、深度学习技术原理或科学AI应用直接相关，而本文研究的是通用数据剪枝技术，不涉及大模型、特定训练方法、推理优化、代理系统或科学领域应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为AlignPrune的噪声鲁棒动态数据剪枝模块，通过基于损失轨迹的动态对齐评分来更准确地识别噪声样本，在多种噪声类型和剪枝比例下显著提升了现有方法的准确性。

摘要翻译

现有动态数据剪枝方法在噪声标签环境下常表现不佳，因其通常依赖逐样本损失作为排序标准。这可能导致错误地保留高损失值的噪声样本，造成显著的性能下降。为解决这一问题，我们提出AlignPrune——一个抗噪声模块，旨在提升标签噪声下动态剪枝的可靠性。具体而言，AlignPrune引入了动态对齐分数（Dynamic Alignment Score, DAS），这是一种基于损失轨迹的评估标准，能够更精准地识别噪声样本，从而提升剪枝效果。作为一个简单高效的即插即用模块，AlignPrune可无缝集成到先进的动态剪枝框架中，在不改变模型架构或训练流程的情况下持续超越基线方法。我们在五种广泛使用的基准数据集上，针对不同噪声类型和剪枝比例进行了大量实验，结果证明了AlignPrune的有效性，其最高可将准确率较现有先进基线提升6.3%。本研究为噪声数据下的剪枝任务提供了可推广的解决方案，推动了对现实场景中学习机制的进一步探索。代码已开源：https://github.com/leonqin430/AlignPrune。

摘要 (Abstract)

Existing dynamic data pruning methods often fail under noisy-label settings, as they typically rely on per-sample loss as the ranking criterion. This could mistakenly lead to preserving noisy samples due to their high loss values, resulting in significant performance drop. To address this, we propose AlignPrune, a noise-robust module designed to enhance the reliability of dynamic pruning under label noise. Specifically, AlignPrune introduces the Dynamic Alignment Score (DAS), which is a loss-trajectory-based criterion that enables more accurate identification of noisy samples, thereby improving pruning effectiveness. As a simple yet effective plug-and-play module, AlignPrune can be seamlessly integrated into state-of-the-art dynamic pruning frameworks, consistently outperforming them without modifying either the model architecture or the training pipeline. Extensive experiments on five widely-used benchmarks across various noise types and pruning ratios demonstrate the effectiveness of AlignPrune, boosting accuracy by up to 6.3% over state-of-the-art baselines. Our results offer a generalizable solution for pruning under noisy data, encouraging further exploration of learning in real-world scenarios. Code is available at: https://github.com/leonqin430/AlignPrune.

关键词: dynamic data pruning, noisy-label settings, loss trajectory alignment, Dynamic Alignment Score, noise-robust, plug-and-play module, pruning effectiveness, benchmark experiments

185. ❌ Distilling Photon-Counting CT into Routine Chest CT through Clinically Validated Degradation Modeling

作者: Junqi Liu, Xinze Zhou, Wenxuan Li, Scott Ye, Arkadiusz Sitek, Xiaofeng Yang, Yucheng Tang, Daguang Xu, Kai Ding, Kang Wang, Yang Yang, Alan L. Yuille, Zongwei Zhou 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07329v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于医学影像处理，特别是CT图像质量增强，使用深度学习技术（如扩散模型和自编码器）来模拟和逆转采集伪影。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、代理系统等）完全无关。唯一有微弱关联的是"AI for Science OR Bioinformatics OR Cheminformatics"，因为该研究属于AI在科学（具体是医学影像学）领域的应用，但论文本身并非直接关于生物信息学或化学信息学，且未涉及大模型，因此给予5分（有一定关联）。其他所有关键词评分为0分（完全无关）。

!!! tip deepseek-chat TL;DR

该研究提出了一种名为SUMI的模拟退化到增强方法，通过利用高质量光子计数CT作为参考，学习逆转低质量常规CT中的真实采集伪影，从而将新兴成像技术的优势系统性地提炼到常规CT中，显著提升了图像质量和下游病变检测性能。

摘要翻译

光子计数CT（PCCT）相较于传统能量积分CT（EICT）能提供更优的图像质量，包括更高的空间分辨率和更低的噪声，但其有限的临床可用性制约了大规模研究和临床部署。为弥合这一差距，我们提出了SUMI方法——一种模拟退化到增强的学习方法，该方法以高质量PCCT作为参考，学习逆转低质量EICT中真实的采集伪影。我们的核心思路是显式建模真实的采集退化过程，将PCCT转化为临床可信的低质量对应图像，并学习逆转此过程。模拟退化过程的临床真实性已由认证放射科医师验证，从而无需大规模配对采集即可实现可靠的监督学习。基于此项技术贡献，我们实现了以下成果：（1）在1,046例PCCT数据上训练了一个潜在扩散模型，该模型使用预先在相同PCCT数据及来自145家医院的405,379例EICT数据上联合预训练的自编码器，以提取通用的CT潜在特征——我们将此特征公开发布以供其他生成式医学影像任务复用；（2）构建了一个大规模数据集，包含超过17,316例经增强至类PCCT质量的公开EICT图像，并由放射科医师逐体素标注了气道树、动脉、静脉、肺及肺叶结构；（3）实验证明该方法带来显著提升：在外部数据上，SUMI在结构相似性指数（SSIM）和峰值信噪比（PSNR）上分别超越当前最优图像翻译方法15%和20%；在读者研究中提高了放射科医师评定的临床效用；并显著提升了下游顶级病灶检测性能，将灵敏度最高提升15%，F1分数最高提升10%。我们的研究表明，通过以有限的高质量扫描作为参考，可将新兴影像技术进步系统性地转化至常规EICT中。

摘要 (Abstract)

Photon-counting CT (PCCT) provides superior image quality with higher spatial resolution and lower noise compared to conventional energy-integrating CT (EICT), but its limited clinical availability restricts large-scale research and clinical deployment. To bridge this gap, we propose SUMI, a simulated degradation-to-enhancement method that learns to reverse realistic acquisition artifacts in low-quality EICT by leveraging high-quality PCCT as reference. Our central insight is to explicitly model realistic acquisition degradations, transforming PCCT into clinically plausible lower-quality counterparts and learning to invert this process. The simulated degradations were validated for clinical realism by board-certified radiologists, enabling faithful supervision without requiring paired acquisitions at scale. As outcomes of this technical contribution, we: (1) train a latent diffusion model on 1,046 PCCTs, using an autoencoder first pre-trained on both these PCCTs and 405,379 EICTs from 145 hospitals to extract general CT latent features that we release for reuse in other generative medical imaging tasks; (2) construct a large-scale dataset of over 17,316 publicly available EICTs enhanced to PCCT-like quality, with radiologist-validated voxel-wise annotations of airway trees, arteries, veins, lungs, and lobes; and (3) demonstrate substantial improvements: across external data, SUMI outperforms state-of-the-art image translation methods by 15% in SSIM and 20% in PSNR, improves radiologist-rated clinical utility in reader studies, and enhances downstream top-ranking lesion detection performance, increasing sensitivity by up to 15% and F1 score by up to 10%. Our results suggest that emerging imaging advances can be systematically distilled into routine EICT using limited high-quality scans as reference.

关键词: Photon-counting CT, Degradation modeling, Latent diffusion model, Image enhancement, Medical imaging, Clinical validation, Lesion detection, Autoencoder

186. ❌ Are Face Embeddings Compatible Across Deep Neural Network Models?

作者: Fizza Rubab, Yiying Tong, Arun Ross 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07282v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究不同深度神经网络模型（包括基础模型）在人脸嵌入表示上的兼容性，主要涉及基础模型和预训练技术，与这两个关键词有中等相关性（5分）。论文未涉及大语言模型（LLMs），而是计算机视觉领域的基础模型，因此’Large Language Models’关键词得5分。其他关键词（如MoE、SLMs、对齐、推理等）均未在论文中涉及，得0分。论文属于AI应用研究，但未具体涉及生物信息学等科学领域，因此’AI for Science’得0分。

!!! tip deepseek-chat TL;DR

该论文研究了不同深度神经网络模型（包括基础模型）在人脸嵌入表示上的兼容性问题，发现通过简单的仿射变换可以显著提高跨模型的人脸识别性能，表明不同模型在面部身份编码上存在表示收敛。

摘要翻译

得益于可针对特定领域任务进行训练的深度神经网络（DNN）模型的空前发展，自动人脸识别在过去十年中取得了快速进展。与此同时，在广泛的视觉或视觉-语言任务上预训练的基础模型（foundation models）已在包括生物识别在内的多个领域展现出卓越的泛化能力。这引出了一个重要问题：尽管在不同数据集、损失函数和架构下训练，不同的DNN模型——无论是领域专用模型还是基础模型——是否以相似的方式编码人脸身份信息？为此，我们直接分析了不同DNN模型所生成的嵌入空间的几何结构。将人脸图像的嵌入视为点云，我们研究了简单的仿射变换是否能够将一个模型的人脸表征与另一个模型的对齐。我们的研究结果揭示了令人惊讶的跨模型兼容性：对于人脸识别（face identification）与验证（verification）任务，低维线性映射相较于未对齐的基线，显著提升了跨模型人脸识别的性能。这种对齐模式在不同数据集间具有泛化性，并随模型家族呈现系统性变化，这表明了人脸身份编码在表征层面存在收敛性。这些发现对模型互操作性、集成设计以及生物特征模板安全具有重要意义。

摘要 (Abstract)

Automated face recognition has made rapid strides over the past decade due to the unprecedented rise of deep neural network (DNN) models that can be trained for domain-specific tasks. At the same time, foundation models that are pretrained on broad vision or vision-language tasks have shown impressive generalization across diverse domains, including biometrics. This raises an important question: Do different DNN models–both domain-specific and foundation models–encode facial identity in similar ways, despite being trained on different datasets, loss functions, and architectures? In this regard, we directly analyze the geometric structure of embedding spaces imputed by different DNN models. Treating embeddings of face images as point clouds, we study whether simple affine transformations can align face representations of one model with another. Our findings reveal surprising cross-model compatibility: low-capacity linear mappings substantially improve cross-model face recognition over unaligned baselines for both face identification and verification tasks. Alignment patterns generalize across datasets and vary systematically across model families, indicating representational convergence in facial identity encoding. These findings have implications for model interoperability, ensemble design, and biometric template security.

关键词: face recognition, deep neural networks, foundation models, embedding compatibility, affine transformation, model interoperability, biometrics, representation alignment

187. ❌ Mem3R: Streaming 3D Reconstruction with Hybrid Memory via Test-Time Training

作者: Changkun Liu, Jiezhi Yang, Zeman Li, Yuan Deng, Jiancong Guo, Luca Ballan 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07279v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文Mem3R专注于计算机视觉领域的流式3D重建，提出了一种混合内存设计来改善长序列的时间一致性。虽然研究涉及深度学习模型（如MLP）和测试时训练（Test-Time Training），但所有关键词都明确针对大语言模型（LLM）及其相关技术（如对齐、推理、代理等）或特定科学领域AI应用（如生物信息学）。论文内容与LLM技术、大模型原理创新或AI for Science应用无直接关联，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Mem3R的流式3D重建模型，通过解耦相机跟踪和几何映射的混合内存设计，显著改善了长序列的时间一致性并减少了模型参数。

摘要翻译

流式三维感知技术非常适用于机器人与增强现实领域，这些场景需要高效且一致地处理长时视觉流。近期出现的循环模型通过维持固定大小的状态并实现线性时间推理，提供了有前景的解决方案，但由于压缩潜在记忆的容量有限，它们往往在长序列中遭受漂移累积和时间遗忘问题。我们提出了Mem3R，一种具有混合记忆设计的流式三维重建模型，它将相机跟踪与几何建图解耦，以提升长序列的时间一致性。在相机跟踪方面，Mem3R采用了一种隐式快速权重记忆，该记忆通过一个轻量级多层感知机实现，并借助测试时训练进行更新。在几何建图方面，Mem3R维持了一个基于令牌的显式固定大小状态。与CUT3R相比，此设计不仅显著提升了长序列性能，还将模型参数量从7.93亿减少至6.44亿。Mem3R支持为CUT3R开发的现有即插即用式状态更新策略。具体而言，将其与TTT3R集成后，在500至1000帧的序列上，相比基础实现，绝对轨迹误差降低了高达39%。由此带来的性能提升也延伸至其他下游任务，包括视频深度估计和三维重建，同时保持了恒定的GPU内存使用量和相当的推理吞吐量。项目页面：https://lck666666.github.io/Mem3R/

摘要 (Abstract)

Streaming 3D perception is well suited to robotics and augmented reality, where long visual streams must be processed efficiently and consistently. Recent recurrent models offer a promising solution by maintaining fixed-size states and enabling linear-time inference, but they often suffer from drift accumulation and temporal forgetting over long sequences due to the limited capacity of compressed latent memories. We propose Mem3R, a streaming 3D reconstruction model with a hybrid memory design that decouples camera tracking from geometric mapping to improve temporal consistency over long sequences. For camera tracking, Mem3R employs an implicit fast-weight memory implemented as a lightweight Multi-Layer Perceptron updated via Test-Time Training. For geometric mapping, Mem3R maintains an explicit token-based fixed-size state. Compared with CUT3R, this design not only significantly improves long-sequence performance but also reduces the model size from 793M to 644M parameters. Mem3R supports existing improved plug-and-play state update strategies developed for CUT3R. Specifically, integrating it with TTT3R decreases Absolute Trajectory Error by up to 39% over the base implementation on 500 to 1000 frame sequences. The resulting improvements also extend to other downstream tasks, including video depth estimation and 3D reconstruction, while preserving constant GPU memory usage and comparable inference throughput. Project page: https://lck666666.github.io/Mem3R/

关键词: Streaming 3D Reconstruction, Hybrid Memory, Test-Time Training, Camera Tracking, Geometric Mapping, Temporal Consistency, Multi-Layer Perceptron, Model Size Reduction

188. ❌ GenLCA: 3D Diffusion for Full-Body Avatars from In-the-Wild Videos

作者: Yiqian Wu, Rawal Khirodkar, Egor Zakharov, Timur Bagautdinov, Lei Xiao, Zhaoen Su, Shunsuke Saito, Xiaogang Jin, Junxuan Li 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07273v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文GenLCA专注于3D扩散模型在全身虚拟人生成和编辑中的应用，核心贡献包括：1）利用预训练的前馈虚拟人重建模型作为可动画的3D分词器，将非结构化视频帧编码为结构化3D标记；2）提出可见性感知扩散训练策略，处理部分观测数据；3）在标记数据集上训练基于流的扩散模型。论文主题为计算机视觉和图形学中的3D生成模型，未涉及大语言模型（LLMs）、深度学习技术原理创新或科学领域应用。所有关键词均与大语言模型、对齐、推理、代理、压缩等技术相关，与论文的3D扩散模型和虚拟人生成无直接关联，因此所有关键词相关度评分为0。

!!! tip deepseek-chat TL;DR

GenLCA提出了一种基于扩散的生成模型，通过创新的可见性感知训练策略和3D分词器，利用大规模真实世界视频数据生成和编辑逼真的全身虚拟人，显著优于现有方法。

摘要翻译

本文提出GenLCA，一种基于扩散的生成模型，能够根据文本和图像输入生成并编辑具有照片级真实感的全身虚拟化身。生成的化身在忠实于输入内容的同时，支持高保真的面部与全身动画。其核心是一种新颖的范式，使得能够从部分可观测的二维数据中训练全身三维扩散模型，从而允许训练数据集扩展至数百万的真实世界视频。这种可扩展性为GenLCA带来了卓越的照片真实感和泛化能力。具体而言，我们通过将预训练的前馈式化身重建模型重新用作可动画的三维标记器来扩展数据集，该标记器将非结构化的视频帧编码为结构化的三维标记。然而，大多数真实世界视频仅提供身体部位的局部观测，导致三维标记中出现过度模糊或透明伪影。为解决此问题，我们提出一种新颖的可见性感知扩散训练策略，该策略用可学习的标记替换无效区域，并仅在有效区域上计算损失。随后，我们在标记数据集上训练一个基于流的扩散模型，本质上保持了预训练化身重建模型所提供的照片真实感和可动画性。我们的方法有效地实现了利用大规模真实世界视频数据在三维空间内原生训练扩散模型。我们通过多样且高保真的生成与编辑结果证明了本方法的有效性，其性能大幅超越现有解决方案。项目页面详见 https://onethousandwu.com/GenLCA-Page。

摘要 (Abstract)

We present GenLCA, a diffusion-based generative model for generating and editing photorealistic full-body avatars from text and image inputs. The generated avatars are faithful to the inputs, while supporting high-fidelity facial and full-body animations. The core idea is a novel paradigm that enables training a full-body 3D diffusion model from partially observable 2D data, allowing the training dataset to scale to millions of real-world videos. This scalability contributes to the superior photorealism and generalizability of GenLCA. Specifically, we scale up the dataset by repurposing a pretrained feed-forward avatar reconstruction model as an animatable 3D tokenizer, which encodes unstructured video frames into structured 3D tokens. However, most real-world videos only provide partial observations of body parts, resulting in excessive blurring or transparency artifacts in the 3D tokens. To address this, we propose a novel visibility-aware diffusion training strategy that replaces invalid regions with learnable tokens and computes losses only over valid regions. We then train a flow-based diffusion model on the token dataset, inherently maintaining the photorealism and animatability provided by the pretrained avatar reconstruction model. Our approach effectively enables the use of large-scale real-world video data to train a diffusion model natively in 3D. We demonstrate the efficacy of our method through diverse and high-fidelity generation and editing results, outperforming existing solutions by a large margin. The project page is available at https://onethousandwu.com/GenLCA-Page.

关键词: 3D diffusion model, full-body avatars, visibility-aware training, animatable 3D tokenizer, large-scale video data, photorealistic generation, avatar editing, diffusion-based generative model

189. ❌ BATON: A Multimodal Benchmark for Bidirectional Automation Transition Observation in Naturalistic Driving

作者: Yuhang Wang, Yiyao Xu, Chaoyun Yang, Lingyao Li, Jingran Sun, Hao Zhou 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07263v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究自动驾驶中控制权转移的多模态数据集和预测任务，涉及计算机视觉、传感器融合和驾驶行为分析，但完全不涉及大模型、深度学习技术原理或AI for Science等关键词。所有关键词均与大模型技术、训练方法、推理优化、对齐、压缩、科学应用等直接相关，而本文专注于驾驶数据集构建和传统机器学习方法评估，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

论文提出了BATON多模态数据集来研究自动驾驶中控制权转移的预测问题，发现视觉输入单独不足，结合车辆信号和路线上下文能显著提升预测性能，并揭示了接管和移交事件的不同时间依赖性。

摘要翻译

现有量产车辆的驾驶自动化系统依赖人类驾驶员决定何时启用自动化功能，同时要求其保持持续关注并随时准备接管。这种设计需要驾驶员进行大量情境判断，并产生显著的认知负荷，导致学习曲线陡峭、用户体验欠佳，以及因过度依赖或接管延迟引发的安全风险。因此，预测驾驶员何时将控制权移交给自动化系统、何时重新接管，对于设计主动且情境感知的人机交互界面至关重要。然而，现有数据集很少能完整记录多模态情境信息，包括道路场景、驾驶员状态、车辆动力学和路线环境。为填补这一空白，我们推出了BATON数据集——一个大规模自然驾驶数据集，记录了127名驾驶员总计136.6小时的现实世界驾驶自动化使用情况。该数据集同步采集了前视视频、舱内视频、解码后的CAN总线信号、基于雷达的前车交互数据以及GPS衍生的路线环境信息，围绕每次控制权转换形成了闭环多模态记录。我们定义了三个基准任务：驾驶行为理解、控制权移交预测和控制权接管预测，并评估了涵盖序列模型、传统分类器和零样本视觉语言模型在内的基线方法。结果表明，仅依靠视觉输入无法实现可靠的转换预测：前视视频能捕捉道路环境但无法反映驾驶员状态，而舱内视频能体现驾驶员准备情况却缺失外部场景信息。融合CAN总线与路线环境信号后，模型性能较纯视频设置显著提升，表明不同模态间存在强互补性。进一步研究发现，接管事件的发展更为渐进且受益于更长的预测时间窗口，而控制权移交事件更依赖于即时情境线索，这种不对称性对辅助驾驶系统的人机交互设计具有直接指导意义。

摘要 (Abstract)

Existing driving automation (DA) systems on production vehicles rely on human drivers to decide when to engage DA while requiring them to remain continuously attentive and ready to intervene. This design demands substantial situational judgment and imposes significant cognitive load, leading to steep learning curves, suboptimal user experience, and safety risks from both over-reliance and delayed takeover. Predicting when drivers hand over control to DA and when they take it back is therefore critical for designing proactive, context-aware HMI, yet existing datasets rarely capture the multimodal context, including road scene, driver state, vehicle dynamics, and route environment. To fill this gap, we introduce BATON, a large-scale naturalistic dataset capturing real-world DA usage across 127 drivers, and 136.6 hours of driving. The dataset synchronizes front-view video, in-cabin video, decoded CAN bus signals, radar-based lead-vehicle interaction, and GPS-derived route context, forming a closed-loop multimodal record around each control transition. We define three benchmark tasks: driving action understanding, handover prediction, and takeover prediction, and evaluate baselines spanning sequence models, classical classifiers, and zero-shot VLMs. Results show that visual input alone is insufficient for reliable transition prediction: front-view video captures road context but not driver state, while in-cabin video reflects driver readiness but not the external scene. Incorporating CAN and route-context signals substantially improves performance over video-only settings, indicating strong complementarity across modalities. We further find takeover events develop more gradually and benefit from longer prediction horizons, whereas handover events depend more on immediate contextual cues, revealing an asymmetry with direct implications for HMI design in assisted driving systems.

关键词: driving automation, multimodal dataset, control transition, handover prediction, takeover prediction, naturalistic driving, HMI design, sensor fusion

190. ❌ Non-identifiability of Explanations from Model Behavior in Deep Networks of Image Authenticity Judgments

作者: Icaro Re Depaolini, Uri Hasson 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07254v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究深度神经网络预测人类真实性判断时的可解释性问题，主要关注视觉模型的归因方法（如Grad-CAM、LIME）的鲁棒性和一致性。论文主题与大多数关键词（如LLM、MoE、RLHF、RAG等）完全无关，因为这些关键词涉及大语言模型、训练技术、推理方法等，而本文聚焦于视觉模型和可解释性。唯一相关的关键词是’Mechanistic Interpretability OR Explainable AI’，因为论文核心是评估模型解释方法的可靠性，属于可解释AI范畴，但并非大模型领域的创新技术，因此给10分（高度相关但非核心创新）。其他关键词均无关联。

!!! tip deepseek-chat TL;DR

该论文研究发现，尽管深度神经网络能有效预测人类对图像真实性的判断，但不同模型产生的解释（归因图）缺乏一致性，表明成功的行为模型的事后解释不能作为认知机制的强证据。

摘要翻译

深度神经网络能够预测人类判断，但这并不意味着它们依赖类人的信息或揭示了这些判断背后的线索。先前的研究已通过归因热力图探讨了这一问题，但其解释价值本身取决于稳健性。本研究通过评估预测人类真实性评分的模型是否能在架构内部及跨架构间产生一致的解释，检验了此类解释的稳健性。我们在多个冻结的预训练视觉模型上拟合轻量级回归头，并使用Grad-CAM、LIME和多尺度像素掩码生成归因图。多种架构均能较好地预测评分，达到约80%的噪声上限。VGG模型通过追踪图像质量而非真实性特异性方差实现这一性能，这限制了其归因图的相关性。在其余模型中，归因图在单一架构内的不同随机种子间总体稳定，尤其是EfficientNetB3和Barlow Twins模型，且在被判定为更真实的图像上一致性更高。关键的是，即使在预测性能相近的情况下，跨架构的归因一致性仍然较弱。为解决此问题，我们采用模型集成方法，这提升了对人类真实性判断的预测能力，并通过像素掩码实现了图像级归因。我们的结论是：尽管深度网络能够很好地预测人类真实性判断，但并未为这些判断产生可辨识的解释。更广泛而言，本研究结果表明，对于成功行为模型的事后解释，应视为认知机制的弱证据。

摘要 (Abstract)

Deep neural networks can predict human judgments, but this does not imply that they rely on human-like information or reveal the cues underlying those judgments. Prior work has addressed this issue using attribution heatmaps, but their explanatory value in itself depends on robustness. Here we tested the robustness of such explanations by evaluating whether models that predict human authenticity ratings also produce consistent explanations within and across architectures. We fit lightweight regression heads to multiple frozen pretrained vision models and generated attribution maps using Grad-CAM, LIME, and multiscale pixel masking. Several architectures predicted ratings well, reaching about 80% of the noise ceiling. VGG models achieved this by tracking image quality rather than authenticity-specific variance, limiting the relevance of their attributions. Among the remaining models, attribution maps were generally stable across random seeds within an architecture, especially for EfficientNetB3 and Barlow Twins, and consistency was higher for images judged as more authentic. Crucially, agreement in attribution across architectures was weak even when predictive performance was similar. To address this, we combined models in ensembles, which improved prediction of human authenticity judgments and enabled image-level attribution via pixel masking. We conclude that while deep networks can predict human authenticity judgments well, they do not produce identifiable explanations for those judgments. More broadly, our findings suggest that post hoc explanations from successful models of behavior should be treated as weak evidence for cognitive mechanism.

关键词: deep neural networks, human authenticity judgments, attribution maps, Grad-CAM, LIME, model interpretability, explanation robustness, vision models

191. ❌ Geo-EVS: Geometry-Conditioned Extrapolative View Synthesis for Autonomous Driving

作者: Yatong Lan, Rongkui Tang, Lei He 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07250v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于自动驾驶中的外推式新颖视图合成，提出了一种几何条件框架Geo-EVS，包含几何感知重投影和伪影引导潜在扩散两个组件。该研究属于计算机视觉和自动驾驶领域，涉及3D重建、视图合成和点云处理技术。所有评分关键词均与大语言模型、深度学习技术原理创新或AI在科学领域的应用相关，而本文研究内容与这些关键词无直接关联，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Geo-EVS的几何条件外推式视图合成框架，解决了自动驾驶中异构传感器生成标准化虚拟视图时因外推姿态导致几何支持弱和缺乏密集目标视图监督的问题，在Waymo数据集上提高了稀疏视图合成质量和几何准确性，并改善了下游3D检测性能。

摘要翻译

外推式新视角合成可通过从异构传感器生成标准化虚拟视图，降低自动驾驶对相机阵列的依赖。现有方法在记录轨迹范围外性能下降，因为外推位姿提供的几何支撑较弱且缺乏密集目标视角监督。关键在于训练期间显式地让模型接触轨迹外条件缺陷。我们提出Geo-EVS——一种稀疏监督下的几何条件化框架。Geo-EVS包含两个组件：几何感知重投影（GAR）使用微调后的VGGT重建彩色点云，并将其重投影至观测视角与虚拟目标位姿，生成几何条件图。该设计统一了训练与推理阶段的重投影路径。伪影引导潜在扩散（AGLD）在训练过程中注入重投影衍生的伪影掩码，使模型学会在缺失支撑条件下恢复结构。针对评估，当缺乏密集外推视角真值时，我们采用激光雷达投影稀疏参考（LPSR）协议。在Waymo数据集上，Geo-EVS显著提升了稀疏视角合成质量与几何精度，尤其在高角度与低覆盖场景中，同时增强了下游三维检测任务的性能。

摘要 (Abstract)

Extrapolative novel view synthesis can reduce camera-rig dependency in autonomous driving by generating standardized virtual views from heterogeneous sensors. Existing methods degrade outside recorded trajectories because extrapolated poses provide weak geometric support and no dense target-view supervision. The key is to explicitly expose the model to out-of-trajectory condition defects during training. We propose Geo-EVS, a geometry-conditioned framework under sparse supervision. Geo-EVS has two components. Geometry-Aware Reprojection (GAR) uses fine-tuned VGGT to reconstruct colored point clouds and reproject them to observed and virtual target poses, producing geometric condition maps. This design unifies the reprojection path between training and inference. Artifact-Guided Latent Diffusion (AGLD) injects reprojection-derived artifact masks during training so the model learns to recover structure under missing support. For evaluation, we use a LiDAR-Projected Sparse-Reference (LPSR) protocol when dense extrapolated-view ground truth is unavailable. On Waymo, Geo-EVS improves sparse-view synthesis quality and geometric accuracy, especially in high-angle and low-coverage settings. It also improves downstream 3D detection.

关键词: Extrapolative View Synthesis, Autonomous Driving, Geometry-Conditioned, Sparse Supervision, LiDAR-Projected Sparse-Reference, 3D Detection, Novel View Synthesis, Point Cloud Reconstruction

192. ❌ TurPy: a physics-based and differentiable optical turbulence simulator for algorithmic development and system optimization

作者: Joseph L. Greene, Alfred Moore, Iris Ochoa, Emily Kwan, Patrick Marano, Christopher R. Valenta 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07248v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文TurPy专注于光学湍流模拟器的开发，用于光学系统设计和优化，属于计算物理和光学工程领域。所有关键词均与大模型、深度学习技术原理或AI应用直接相关，但论文内容完全不涉及这些主题。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文涉及科学计算和模拟，但并非核心AI方法，因此给予5分（有一定关联）。其他关键词与论文主题无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文开发了TurPy，一个基于物理、可微分的GPU加速光学湍流模拟器，用于高保真模拟和端到端光学系统优化，并通过优化双域衍射深度神经网络在弱湍流路径中恢复高斯光束，实现了超过20倍的闪烁减少。

摘要翻译

为自由空间应用开发光学系统需要能够精确捕捉湍流引起的波前畸变并支持基于梯度优化的仿真工具。本文介绍TurPy，一种基于GPU加速、完全可微分的波光学湍流模拟器，旨在将高保真仿真与端到端光学系统设计相衔接。TurPy将次谐波相位屏生成、自回归时间演化以及一个平衡傅里叶混叠约束与弱湍流近似的自动相位屏布局程序整合为一个统一、用户友好的框架。由于TurPy的相位屏生成通过介质特定的功率谱密度进行参数化，该框架可扩展至大气、海洋及生物传播环境，仅需极少修改。我们通过将二阶高斯光束展宽与四阶平面波闪烁与闭合形式模型进行匹配来验证TurPy，在弱到强湍流区间内达到98%的准确度，仅需介质的折射率结构常数和功率谱密度作为输入。为展示TurPy作为基于梯度的训练平台，我们在双掩模双域架构中优化了一个双域衍射深度神经网络，以从弱湍流路径中恢复高斯光束，并在仿真中实现了相对于无补偿接收器超过20倍的闪烁抑制。TurPy已作为开源软件包发布，以支持合成数据生成、湍流感知算法开发以及在湍流环境中运行的光学平台的端到端设计。

摘要 (Abstract)

Developing optical systems for free-space applications requires simulation tools that accurately capture turbulence-induced wavefront distortions and support gradient-based optimization. Here we introduce TurPy, a GPU-accelerated, fully differentiable wave optics turbulence simulator to bridge high fidelity simulation with end-to-end optical system design. TurPy incorporates subharmonic phase screen generation, autoregressive temporal evolution, and an automated screen placement routine balancing Fourier aliasing constraints and weak-turbulence approximations into a unified, user-ready framework. Because TurPy’s phase screen generation is parameterized through a media-specific power spectral density, the framework extends to atmospheric, oceanic, and biological propagation environments with minimal modification. We validate TurPy against established atmospheric turbulence theory by matching 2nd order Gaussian beam broadening and 4th order plane wave scintillation to closed-form models with 98% accuracy across weak to strong turbulence regimes, requiring only the medium’s refractive index structure constant and power spectral density as inputs. To demonstrate TurPy as a gradient-based training platform, we optimize a dual-domain diffractive deep neural network (D2NN) in a two-mask dual-domain architecture to recover a Gaussian beam from a weakly turbulent path and achieving over 20x reduction in scintillation relative to an uncompensated receiver in simulation. TurPy is released as an open-source package to support synthetic data generation, turbulence-informed algorithm development, and the end-to-end design of optical platforms operating in turbulent environments.

关键词: optical turbulence simulator, differentiable wave optics, GPU-accelerated, phase screen generation, gradient-based optimization, diffractive deep neural network, atmospheric turbulence, end-to-end optical system design

193. ❌ PhyEdit: Towards Real-World Object Manipulation via Physically-Grounded Image Editing

作者: Ruihang Xu, Dewei Zhou, Xiaolong Shen, Fan Ma, Yi Yang 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07230v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	5.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文PhyEdit专注于计算机视觉领域的图像编辑，特别是通过3D几何模拟实现物理准确的对象操作。虽然摘要中提到"interactive world models"，但这指的是视觉世界模型而非大语言模型相关的通用世界模型。论文的核心技术是视觉生成模型、3D几何模拟和2D-3D监督，与提供的关键词（主要围绕大语言模型技术、训练方法、推理优化、代理系统等）基本无关。只有"World Models AND General World Models"因提及"world models"而获得5分（有一定关联），其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

该论文解决了现有图像编辑方法在物理准确对象操作上的不足，通过提出PhyEdit框架结合3D几何模拟和2D-3D监督，显著提升了3D几何准确性和操作一致性。

摘要翻译

在图像编辑中实现物理精确的物体操控，对于其在交互式世界模型中的潜在应用至关重要。然而，现有的视觉生成模型往往难以进行精确的空间操控，导致物体缩放和定位错误。这一局限主要源于缺乏整合三维几何与透视投影的显式机制。为实现精确操控，我们开发了PhyEdit——一个利用显式几何模拟作为上下文三维感知视觉引导的图像编辑框架。通过将此即插即用的三维先验知识与二维-三维联合监督相结合，我们的方法有效提升了物理精确性与操控一致性。为支持此方法并评估性能，我们提出了一个真实世界数据集RealManip-10K，用于三维感知物体操控，该数据集包含配对图像与深度标注。我们还提出了ManipEval基准，该基准采用多维指标来评估三维空间控制与几何一致性。大量实验表明，我们的方法在三维几何精确性与操控一致性方面均优于现有方法，包括强大的闭源模型。

摘要 (Abstract)

Achieving physically accurate object manipulation in image editing is essential for its potential applications in interactive world models. However, existing visual generative models often fail at precise spatial manipulation, resulting in incorrect scaling and positioning of objects. This limitation primarily stems from the lack of explicit mechanisms to incorporate 3D geometry and perspective projection. To achieve accurate manipulation, we develop PhyEdit, an image editing framework that leverages explicit geometric simulation as contextual 3D-aware visual guidance. By combining this plug-and-play 3D prior with joint 2D–3D supervision, our method effectively improves physical accuracy and manipulation consistency. To support this method and evaluate performance, we present a real-world dataset, RealManip-10K, for 3D-aware object manipulation featuring paired images and depth annotations. We also propose ManipEval, a benchmark with multi-dimensional metrics to evaluate 3D spatial control and geometric consistency. Extensive experiments show that our approach outperforms existing methods, including strong closed-source models, in both 3D geometric accuracy and manipulation consistency.

关键词: image editing, object manipulation, 3D geometry, physical accuracy, visual generative models, geometric simulation, RealManip-10K dataset, ManipEval benchmark

194. ❌ VersaVogue: Visual Expert Orchestration and Preference Alignment for Unified Fashion Synthesis

作者: Jian Yu, Fei Shen, Cong Wang, Yi Xin, Si Shen, Xiaoyu Du, Jinhui Tang 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07210v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究基于扩散模型的时尚图像生成，与大多数大语言模型关键词无关。但明确使用了两种关键技术：1）Mixture of Experts（MoE）机制用于特征路由（trait-routing attention模块），得10分；2）Direct Preference Optimization（DPO）用于偏好优化（MPO流程），得10分。其他关键词如LLMs、SLMs、Scaling Laws、PEFT、RAG等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了VersaVogue框架，通过混合专家机制和直接偏好优化技术，解决了多条件可控时尚图像生成中属性纠缠和语义干扰的问题，在服装生成和虚拟试衣任务上实现了更好的视觉保真度和可控性。

摘要翻译

扩散模型在时尚图像生成领域取得了显著进展，但先前研究通常将服装生成与虚拟试穿视为独立问题，限制了其在真实时尚工作流程中的灵活性。此外，多源异构条件下的时尚图像合成仍具挑战性，现有方法多依赖简单的特征拼接或静态分层注入机制，常导致属性纠缠与语义干扰。为解决这些问题，我们提出VersaVogue——一个支持多条件可控时尚合成的统一框架，可同时处理服装生成与虚拟试穿任务，对应时尚生命周期中的设计与展示阶段。具体而言，我们设计了特征路由注意力（Trait-routing Attention, TA）模块，该模块通过专家混合机制动态地将条件特征路由至最兼容的专家组件及生成层，实现对纹理、廓形、色彩等视觉属性的解耦注入。为提升真实感与可控性，我们开发了自动化多视角偏好优化（Multi-perspective Preference Optimization, MPO）流程，该流程无需人工标注或任务特定奖励模型即可构建偏好数据。通过综合内容保真度、文本对齐度与感知质量的评估体系，MPO筛选出可靠的偏好配对数据，进而通过直接偏好优化（Direct Preference Optimization, DPO）方法训练模型。在服装生成与虚拟试穿基准上的大量实验表明，VersaVogue在视觉保真度、语义一致性与细粒度可控性方面均优于现有方法。

摘要 (Abstract)

Diffusion models have driven remarkable advancements in fashion image generation, yet prior works usually treat garment generation and virtual dressing as separate problems, limiting their flexibility in real-world fashion workflows. Moreover, fashion image synthesis under multi-source heterogeneous conditions remains challenging, as existing methods typically rely on simple feature concatenation or static layer-wise injection, which often causes attribute entanglement and semantic interference. To address these issues, we propose VersaVogue, a unified framework for multi-condition controllable fashion synthesis that jointly supports garment generation and virtual dressing, corresponding to the design and showcase stages of the fashion lifecycle. Specifically, we introduce a trait-routing attention (TA) module that leverages a mixture-of-experts mechanism to dynamically route condition features to the most compatible experts and generative layers, enabling disentangled injection of visual attributes such as texture, shape, and color. To further improve realism and controllability, we develop an automated multi-perspective preference optimization (MPO) pipeline that constructs preference data without human annotation or task-specific reward models. By combining evaluators of content fidelity, textual alignment, and perceptual quality, MPO identifies reliable preference pairs, which are then used to optimize the model via direct preference optimization (DPO). Extensive experiments on both garment generation and virtual dressing benchmarks demonstrate that VersaVogue consistently outperforms existing methods in visual fidelity, semantic consistency, and fine-grained controllability.

关键词: fashion image generation, diffusion models, mixture-of-experts, direct preference optimization, multi-condition controllable synthesis, virtual dressing, trait-routing attention, preference alignment

195. ❌ INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling

作者: InSpatio Team, Donghui Shen, Guofeng Zhang, Haomin Liu, Haoyu Ji, Hujun Bao, Hongjia Zhai, Jialin Liu, Jing Guo, Nan Wang, Siji Pan, Weihong Pan, Weijian Xie, Xianbin Liu, Xiaojun Xiang, Xiaoyu Zhang, Xinyu Chen, Yifu Wang, Yipeng Chen, Zhenzhou Fan, Zhewen Le, Zhichao Ye, Ziqiang Zhao 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07209v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文INSPATIO-WORLD专注于计算机视觉领域的4D世界模拟器，通过时空自回归建模实现实时交互式场景生成。论文核心是构建具有空间一致性和实时交互性的世界模型，这与关键词’World Models AND General World Models’高度相关（10分），因为论文明确提出了构建世界模型的方法。然而，论文未涉及大语言模型（LLMs）、深度学习技术原理创新、或大模型在不同领域的应用，所有其他关键词均与论文内容完全无关（0分）。论文属于计算机视觉和图形学领域，而非大模型或深度学习技术研究。

!!! tip deepseek-chat TL;DR

该论文解决了构建具有空间一致性和实时交互性的4D世界模型的挑战，提出了INSPATIO-WORLD框架，通过时空自回归架构和联合分布匹配蒸馏，实现了从单参考视频恢复和生成高保真动态交互场景，并在WorldScore-Dynamic基准测试中取得了最先进的性能。

摘要翻译

构建具有空间一致性与实时交互性的世界模型，始终是计算机视觉领域的核心挑战。当前视频生成范式常受限于空间持续性不足与视觉真实感欠缺，难以支撑复杂环境中的无缝导航。为解决这些问题，我们提出INSPATIO-WORLD——一种能够从单段参考视频中恢复并生成高保真动态交互场景的新型实时框架。该方法的核心是时空自回归（Spatiotemporal Autoregressive, STAR）架构，其通过两个紧密耦合的组件实现一致且可控的场景演化：隐式时空缓存（Implicit Spatiotemporal Cache）将参考信息与历史观测聚合为潜在世界表征，确保长时程导航中的全局一致性；显式空间约束模块（Explicit Spatial Constraint Module）强化几何结构，并将用户交互转化为精确且物理合理的相机轨迹。此外，我们提出联合分布匹配蒸馏（Joint Distribution Matching Distillation, JDMD）。该方法以真实世界数据分布作为正则化引导，有效克服了因过度依赖合成数据而导致的保真度下降问题。大量实验表明，INSPATIO-WORLD在空间一致性与交互精确度上显著优于现有先进（state-of-the-art, SOTA）模型，在WorldScore-Dynamic基准测试的实时交互方法中位列第一，并为基于单目视频重建的四维环境导航建立了实用化流程。

摘要 (Abstract)

Building world models with spatial consistency and real-time interactivity remains a fundamental challenge in computer vision. Current video generation paradigms often struggle with a lack of spatial persistence and insufficient visual realism, making it difficult to support seamless navigation in complex environments. To address these challenges, we propose INSPATIO-WORLD, a novel real-time framework capable of recovering and generating high-fidelity, dynamic interactive scenes from a single reference video. At the core of our approach is a Spatiotemporal Autoregressive (STAR) architecture, which enables consistent and controllable scene evolution through two tightly coupled components: Implicit Spatiotemporal Cache aggregates reference and historical observations into a latent world representation, ensuring global consistency during long-horizon navigation; Explicit Spatial Constraint Module enforces geometric structure and translates user interactions into precise and physically plausible camera trajectories. Furthermore, we introduce Joint Distribution Matching Distillation (JDMD). By using real-world data distributions as a regularizing guide, JDMD effectively overcomes the fidelity degradation typically caused by over-reliance on synthetic data. Extensive experiments demonstrate that INSPATIO-WORLD significantly outperforms existing state-of-the-art (SOTA) models in spatial consistency and interaction precision, ranking first among real-time interactive methods on the WorldScore-Dynamic benchmark, and establishing a practical pipeline for navigating 4D environments reconstructed from monocular videos.

关键词: world models, 4D world simulator, spatiotemporal autoregressive modeling, real-time interactivity, spatial consistency, dynamic interactive scenes, joint distribution matching distillation, monocular video reconstruction

196. ❌ BRIDGE: Multimodal-to-Text Retrieval via Reinforcement-Learned Query Alignment

作者: Mohamed Darwish Mounis, Mohamed Mahmoud, Shaimaa Sedek, Mahmoud Abdalla, Mahmoud SalahEldin Kasem, Abdelrahman Abdallah, Hyun-Soo Kang 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07201v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文BRIDGE提出了一种通过强化学习训练查询对齐模型（FORGE）和推理增强的密集检索器（LENS）来解决多模态到文本检索中查询不匹配问题的系统。核心创新在于使用强化学习进行查询对齐（与RLHF/RLAIF/DPO高度相关）和检索增强生成（RAG）技术。论文涉及大模型应用（LLMs）、监督微调（SFT）、指令对齐（Alignment）和思维链推理（CoT Reasoning），但未涉及其他技术如MoE、量化、科学AI等。

!!! tip deepseek-chat TL;DR

论文解决了多模态到文本检索中查询不匹配的问题，通过强化学习训练的查询对齐模型和推理增强的检索器，在MM-BRIGHT数据集上超越了现有多模态编码器和文本检索器的性能。

摘要翻译

多模态检索系统在处理针对纯文本语料库的图文查询时面临困难：在MM-BRIGHT数据集上，最优的视觉-语言编码器仅达到27.6 nDCG@10，表现逊于强大的纯文本检索器。我们认为瓶颈不在于检索器而在于查询本身——原始多模态查询以系统性地降低嵌入相似度的方式，混杂了视觉描述、对话噪声和检索意图。我们提出\textbf{BRIDGE}，一个无需多模态编码器即可解决此不匹配问题的双组件系统。\textbf{FORGE}（\textbf{F}ocused Retrieval Query Generato\textbf{r}，聚焦检索查询生成器）是一个通过强化学习训练的查询对齐模型，能将嘈杂的多模态查询提炼为紧凑且检索优化的搜索字符串。\textbf{LENS}（\textbf{L}anguage-\textbf{E}nhanced \textbf{N}eural \textbf{S}earch，语言增强神经检索器）是一个基于推理密集型检索数据微调的推理增强稠密检索器，用于处理FORGE生成的富含意图的查询。在MM-BRIGHT数据集（2,803条查询，29个领域）上评估，BRIDGE达到\textbf{29.7} nDCG@10，超越了包括Nomic-Vision（27.6）在内的所有多模态编码器基线。当FORGE作为即插即用的对齐器应用于Nomic-Vision之上时，组合系统达到\textbf{33.3} nDCG@10——超过了最佳纯文本检索器（32.2）——证明\textit{查询对齐}是多模态到文本检索的关键瓶颈。https://github.com/mm-bright/multimodal-reasoning-retrieval

摘要 (Abstract)

Multimodal retrieval systems struggle to resolve image-text queries against text-only corpora: the best vision-language encoder achieves only 27.6 nDCG@10 on MM-BRIGHT, underperforming strong text-only retrievers. We argue the bottleneck is not the retriever but the query – raw multimodal queries entangle visual descriptions, conversational noise, and retrieval intent in ways that systematically degrade embedding similarity. We present \textbf{BRIDGE}, a two-component system that resolves this mismatch without multimodal encoders. \textbf{FORGE} (\textbf{F}ocused Retrieval Query Generato\textbf{r}) is a query alignment model trained via reinforcement learning, which distills noisy multimodal queries into compact, retrieval-optimized search strings. \textbf{LENS} (\textbf{L}anguage-\textbf{E}nhanced \textbf{N}eural \textbf{S}earch) is a reasoning-enhanced dense retriever fine-tuned on reasoning-intensive retrieval data to handle the intent-rich queries FORGE produces. Evaluated on MM-BRIGHT (2,803 queries, 29 domains), BRIDGE achieves \textbf{29.7} nDCG@10, surpassing all multimodal encoder baselines including Nomic-Vision (27.6). When FORGE is applied as a plug-and-play aligner on top of Nomic-Vision, the combined system reaches \textbf{33.3} nDCG@10 – exceeding the best text-only retriever (32.2) – demonstrating that \textit{query alignment} is the key bottleneck in multimodal-to-text retrieval. https://github.com/mm-bright/multimodal-reasoning-retrieval

关键词: multimodal retrieval, reinforcement learning, query alignment, dense retriever, reasoning-enhanced, text-only corpora, MM-BRIGHT, retrieval-optimized

197. ❌ Multiple Domain Generalization Using Category Information Independent of Domain Differences

作者: Reiji Saito, Kazuhiro Hotta 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07175v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究领域泛化（Domain Generalization）在医学图像分割（血管和细胞核）中的应用，属于计算机视觉和医学图像分析领域。论文的核心是提出一种分离类别信息和领域差异信息的方法，并使用SQ-VAE吸收领域差异。与评分关键词的相关性分析如下：1. 与"Pre-training OR Continual Pre-training OR Domain Adaptation"相关度5分：论文涉及领域泛化，与"Domain Adaptation"有一定关联，但论文重点不是预训练或持续预训练。2. 与"AI for Science OR Bioinformatics OR Cheminformatics"相关度5分：论文应用于生物医学图像分割（血管和细胞核），属于生物信息学相关应用。3. 其他关键词（如LLMs、MoE、RLHF等）均为0分：论文未涉及大语言模型、深度学习技术原理创新或相关技术，主要使用传统深度学习方法和变分自编码器。

!!! tip deepseek-chat TL;DR

该论文提出了一种领域泛化方法，通过分离类别信息和领域差异信息，并结合SQ-VAE吸收领域差异，在血管和细胞核分割任务中提高了准确性。

摘要翻译

领域泛化是一种旨在使模型在应用于与训练所用数据集（源域）不同的新环境或数据集（未见域）时仍能保持高准确性的技术。通常，在特定数据集（源域）上训练的模型，当在不同数据集（目标域）上评估时，其准确性往往会显著下降。这一问题的产生源于由成像设备、染色方法等不同环境条件所导致的领域差异。因此，我们采取了两项举措来执行不依赖于领域差异的分割。我们提出了一种方法，该方法将与领域差异无关的类别信息从源域特有的信息中分离出来。通过使用与领域差异无关的信息，我们的方法能够学习分割目标（例如血管和细胞核）。尽管我们提取了领域差异的独立信息，但这并不能完全弥合训练数据与测试数据之间的领域差距。因此，我们利用随机量化变分自编码器（SQ-VAE）中的量子向量来吸收这一领域差距。在实验中，我们在血管分割和细胞核分割的数据集上评估了我们的方法。与传统方法相比，我们的方法提高了准确性。

摘要 (Abstract)

Domain generalization is a technique aimed at enabling models to maintain high accuracy when applied to new environments or datasets (unseen domains) that differ from the datasets used in training. Generally, the accuracy of models trained on a specific dataset (source domain) often decreases significantly when evaluated on different datasets (target domain). This issue arises due to differences in domains caused by varying environmental conditions such as imaging equipment and staining methods. Therefore, we undertook two initiatives to perform segmentation that does not depend on domain differences. We propose a method that separates category information independent of domain differences from the information specific to the source domain. By using information independent of domain differences, our method enables learning the segmentation targets (e.g., blood vessels and cell nuclei). Although we extract independent information of domain differences, this cannot completely bridge the domain gap between training and test data. Therefore, we absorb the domain gap using the quantum vectors in Stochastically Quantized Variational AutoEncoder (SQ-VAE). In experiments, we evaluated our method on datasets for vascular segmentation and cell nucleus segmentation. Our methods improved the accuracy compared to conventional methods.

关键词: Domain Generalization, Segmentation, Medical Image Analysis, Vascular Segmentation, Cell Nucleus Segmentation, SQ-VAE, Domain Adaptation, Independent Information Extraction

198. ❌ DINO-QPM: Adapting Visual Foundation Models for Globally Interpretable Image Classification

作者: Robert Zimmermann, Thomas Norrenbrock, Bodo Rosenhahn 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07166v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	8.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文聚焦于视觉基础模型（DINOv2）的适配与可解释性增强，属于基础模型应用与可解释AI领域。核心相关关键词：1）‘Foundation Models’（8分）：直接研究视觉基础模型DINOv2的适配；2）‘PEFT’（8分）：提出轻量级适配器DINO-QPM，属于参数高效微调范畴；3）‘Explainable AI’（10分）：核心目标是实现全局可解释的图像分类，提出可解释性指标。其他关键词如’Pre-training’（5分）因涉及基础模型预训练特征而弱相关，其余关键词（如LLM技术、科学AI应用等）与论文的视觉模型可解释性主题无关。

!!! tip deepseek-chat TL;DR

该论文针对视觉基础模型DINOv2特征复杂难解释的问题，提出轻量级适配器DINO-QPM，通过对比表示和稀疏损失实现全局可解释的图像分类，在保持高准确率的同时提升了解释质量。

摘要翻译

尽管DINOv2等视觉基础模型作为特征提取器提供了最先进的性能，但其复杂的高维表征为可解释性带来了显著障碍。本研究提出DINO-QPM方法，将这些强大但纠缠的特征转化为人类可理解的、与类别无关的对比表征。DINO-QPM是一种轻量级可解释性适配器，致力于实现全局可解释的图像分类，通过改造二次规划增强模型（Quadratic Programming Enhanced Model, QPM）使其能在严格冻结的DINO骨干网络上运行。虽然视觉基础模型的分类通常依赖\texttt{CLS}标记，但我们有意突破这一标准范式。通过采用平均池化操作，我们将图像块嵌入直接与模型特征连接，从而在输入空间中实现DINO-QPM全局可解释特征的空间定位。此外，我们引入稀疏性损失以最小化空间分散和背景噪声，确保解释结果锚定在相关物体部件上。DINO-QPM使QPM的可解释性层级能够以适配器形式实现，同时其分类精度超越了DINOv2线性探针方法。通过提出的合理性度量（Plausibility metric）及其他可解释性指标进行评估，大量实验证明DINO-QPM在分类精度和解释质量上均优于其他适用于冻结视觉基础模型的方法。

摘要 (Abstract)

Although visual foundation models like DINOv2 provide state-of-the-art performance as feature extractors, their complex, high-dimensional representations create substantial hurdles for interpretability. This work proposes DINO-QPM, which converts these powerful but entangled features into contrastive, class-independent representations that are interpretable by humans. DINO-QPM is a lightweight interpretability adapter that pursues globally interpretable image classification, adapting the Quadratic Programming Enhanced Model (QPM) to operate on strictly frozen DINO backbones. While classification with visual foundation models typically relies on the \texttt{CLS} token, we deliberately diverge from this standard. By leveraging average-pooling, we directly connect the patch embeddings to the model’s features and therefore enable spatial localisation of DINO-QPM’s globally interpretable features within the input space. Furthermore, we apply a sparsity loss to minimise spatial scatter and background noise, ensuring that explanations are grounded in relevant object parts. With DINO-QPM we make the level of interpretability of QPM available as an adapter while exceeding the accuracy of DINOv2 linear probe. Evaluated through an introduced Plausibility metric and other interpretability metrics, extensive experiments demonstrate that DINO-QPM is superior to other applicable methods for frozen visual foundation models in both classification accuracy and explanation quality.

关键词: visual foundation models, interpretability, DINOv2, adapter, image classification, spatial localization, explainable AI, feature extraction

199. ❌ Learning to Search: A Decision-Based Agent for Knowledge-Based Visual Question Answering

作者: Zhuohong Chen, Zhenxian Wu, Yunyao Yu, Hangrui Xu, Zirui Liao, Zhifang Liu, Xiangwen Deng, Pen Jiao, Haoqian Wang 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07146v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	8.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种基于搜索代理的KB-VQA方法，将问题解决过程建模为多步决策过程，与多个关键词高度相关：1）核心使用检索增强生成（RAG）方法（10分）；2）构建LLM代理系统进行自主决策（10分）；3）涉及工具使用（图像检索、文本检索等）（8分）；4）采用监督微调（SFT）方法（8分）；5）涉及多步推理过程（8分）；6）基于大语言模型/基础模型构建（8分）。其他关键词如MoE、量化、对齐等未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文针对知识库视觉问答中传统检索增强生成方法难以适应多样化问题类型、检索与推理分离的问题，提出了一种基于搜索代理的框架，将问题解决过程建模为多步决策过程，通过自动化收集的多步轨迹进行微调，在InfoSeek和E-VQA数据集上取得了最先进的性能。

摘要翻译

基于知识的视觉问答（Knowledge-based Visual Question Answering, KB-VQA）要求视觉语言模型理解图像并利用外部知识，尤其针对罕见实体和长尾事实。现有的大多数检索增强生成（Retrieval-Augmented Generation, RAG）方法采用固定的流程，依次进行信息检索、过滤并生成答案。这种设计难以适应多样化的问题类型。此外，它将检索与推理分离，使模型难以决定何时搜索、如何优化查询或何时停止检索，导致检索到的证据往往与问题匹配不佳。为应对这些局限，我们将KB-VQA重新定义为搜索智能体问题，并将其求解过程建模为多步决策流程。在每一步中，智能体根据当前信息状态从四种行动中选择其一——回答、图像检索、文本检索或基于图像描述生成。我们进一步设计了一个自动化流程来收集多步轨迹，这些轨迹记录了智能体的推理过程、工具使用和中间决策。这些轨迹随后被用作微调的监督数据。在InfoSeek和E-VQA数据集上的实验表明，我们的方法取得了最先进的性能，持续优于现有基线，验证了该框架的有效性。

摘要 (Abstract)

Knowledge-based visual question answering (KB-VQA) requires vision-language models to understand images and use external knowledge, especially for rare entities and long-tail facts. Most existing retrieval-augmented generation (RAG) methods adopt a fixed pipeline that sequentially retrieves information, filters it, and then produces an answer. Such a design makes it difficult to adapt to diverse question types. Moreover, it separates retrieval from reasoning, making it hard for the model to decide when to search, how to refine queries, or when to stop. As a result, the retrieved evidence is often poorly aligned with the question. To address these limitations, we reformulate KB-VQA as a search-agent problem and model the solving process as a multi-step decision-making procedure. At each step, the agent selects one of four actions-Answer, Image Retrieval, Text Retrieval, and Caption-based on its current information state. We further design an automated pipeline to collect multi-step trajectories that record the agent’s reasoning process, tool usage, and intermediate decisions. These trajectories are then used as supervision for fine-tuning. Experiments on InfoSeek and E-VQA demonstrate that our method achieves state-of-the-art performance, consistently outperforming prior baselines and confirming the effectiveness of our framework.

关键词: Knowledge-based Visual Question Answering, Retrieval-Augmented Generation, Search Agent, Multi-step Decision Making, Tool Usage, Supervised Fine-tuning, State-of-the-art Performance

200. ❌ An RTK-SLAM Dataset for Absolute Accuracy Evaluation in GNSS-Degraded Environments

作者: Wei Zhang, Vincent Ress, David Skuddis, Uwe Soergel, Norbert Haala 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07151v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究RTK-SLAM系统的绝对精度评估，属于机器人学、测绘和定位领域，与所有评分关键词（均围绕大模型、深度学习技术原理及其应用）完全无关。论文未涉及任何语言模型、模型训练、推理优化、对齐、代理系统或AI for Science相关内容。

!!! tip deepseek-chat TL;DR

该论文揭示了RTK-SLAM系统评估中标准SE(3)对齐方法会低估绝对定位误差的问题，并提出了一个独立于GNSS的地面真值数据集和评估方法，结果表明RTK-SLAM在开阔天空条件下可达厘米级绝对精度，在室内保持分米级精度，而单独RTK在室内会退化至数十米误差。

摘要翻译

RTK-SLAM系统将即时定位与地图构建（SLAM）和实时动态（RTK）GNSS定位技术相结合，有望在实现相对一致性的同时提供全局参考坐标，从而支持高效的地理参考测量。一个关键但未被充分重视的问题是，标准评估指标——绝对轨迹误差（ATE）——在计算误差前，会先在估计轨迹与参考轨迹之间拟合一个最优刚体变换。这种所谓的SE(3)对齐过程会吸收全局漂移和系统误差，使得轨迹在实际中显得比真实情况更精确，因此不适用于评估RTK-SLAM的全局精度。本文提出了一个大地测量参考数据集及评估方法，以揭示这一缺陷。其核心设计原则是：RTK接收仪仅作为系统输入使用，而真值则通过大地测量全站仪独立建立。现有数据集中普遍缺少这种分离，因为GNSS通常被用作（或部分作为）真值。该数据集使用手持式RTK-SLAM设备采集，包含两个场景。我们评估了激光雷达-惯性、视觉-惯性以及激光雷达-视觉-惯性RTK-SLAM系统，并与独立RTK进行了对比，通过报告直接的全局精度和经SE(3)对齐的相对精度，明确揭示了二者间的差距。结果表明，SE(3)对齐可能使绝对定位误差被低估高达76%。在开阔天空条件下，RTK-SLAM可实现厘米级绝对精度；在室内环境中，即使独立RTK性能下降至数十米量级，RTK-SLAM仍能保持分米级全局精度。数据集、校准文件和评估脚本已公开于https://rtk-slam-dataset.github.io/。

摘要 (Abstract)

RTK-SLAM systems integrate simultaneous localization and mapping (SLAM) with real-time kinematic (RTK) GNSS positioning, promising both relative consistency and globally referenced coordinates for efficient georeferenced surveying. A critical and underappreciated issue is that the standard evaluation metric, Absolute Trajectory Error (ATE), first fits an optimal rigid-body transformation between the estimated trajectory and reference before computing errors. This so-called SE(3) alignment absorbs global drift and systematic errors, making trajectories appear more accurate than they are in practice, and is unsuitable for evaluating the global accuracy of RTK-SLAM. We present a geodetically referenced dataset and evaluation methodology that expose this gap. A key design principle is that the RTK receiver is used solely as a system input, while ground truth is established independently via a geodetic total station. This separation is absent from all existing datasets, where GNSS typically serves as (part of) the ground truth. The dataset is collected with a handheld RTK-SLAM device, comprising two scenes. We evaluate LiDAR-inertial, visual-inertial, and LiDAR-visual-inertial RTK-SLAM systems alongside standalone RTK, reporting direct global accuracy and SE(3)-aligned relative accuracy to make the gap explicit. Results show that SE(3) alignment can underestimate absolute positioning error by up to 76%. RTK-SLAM achieves centimeter-level absolute accuracy in open-sky conditions and maintains decimeter-level global accuracy indoors, where standalone RTK degrades to tens of meters. The dataset, calibration files, and evaluation scripts are publicly available at https://rtk-slam-dataset.github.io/.

关键词: RTK-SLAM, Absolute accuracy evaluation, GNSS-degraded environments, Geodetic total station, SE(3) alignment, Global positioning error, LiDAR-inertial, Visual-inertial

201. ❌ USCNet: Transformer-Based Multimodal Fusion with Segmentation Guidance for Urolithiasis Classification

作者: Changmiao Wang, Songqi Zhang, Yongquan Zhang, Yifei Wang, Liya Liu, Nannan Li, Xingzhi Li, Jiexin Pan, Yi Jiang, Xiang Wan, Hai Wang, Ahmed Elazab 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07141v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文专注于医学影像分析（CT图像）与临床数据（EHR）的多模态融合，用于肾结石分类。虽然使用了Transformer架构，但这是计算机视觉中的视觉Transformer（ViT）应用，而非大语言模型（LLM）。所有关键词（如LLM、MoE、Scaling Laws、Alignment、RAG、Agents等）均与论文内容无关，因为论文不涉及语言模型、大模型技术原理或通用AI方法。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在生物医学（泌尿学）领域的应用，但创新性主要体现在多模态融合和动态损失函数设计，而非大模型技术，因此给8分（有一定关联，但非核心）。

!!! tip deepseek-chat TL;DR

该研究提出了一种基于Transformer的多模态融合网络（USCNet），结合CT图像和电子健康记录，实现了肾结石的术前精确分类，显著超越了现有方法。

摘要翻译

肾结石疾病是泌尿外科最常见的病症之一，理解结石成分对于制定个体化治疗方案和预防复发至关重要。目前分析肾结石的方法依赖于术后标本，这阻碍了术前的快速分类。为克服这一局限，我们提出了一种称为尿路结石分割与分类网络（Urinary Stone Segmentation and Classification Network, USCNet）的新方法。该创新技术通过整合计算机断层扫描（Computed Tomography, CT）图像与电子健康记录（Electronic Health Records, EHR）中的临床数据，实现了肾结石的精准术前分类。USCNet采用基于Transformer的多模态融合框架，结合CT-EHR注意力机制和分割引导注意力模块以达成精确分类。此外，本研究引入了动态损失函数，以有效平衡分割与分类的双重目标。在自建肾结石数据集上的实验表明，USCNet在所有评估指标上均表现出卓越性能，其分类效果显著超越现有主流方法。本研究为肾结石的精准术前分类提供了具有前景的解决方案，具有重要的临床价值。源代码已公开：https://github.com/ZhangSongqi0506/KidneyStone。

摘要 (Abstract)

Kidney stone disease ranks among the most prevalent conditions in urology, and understanding the composition of these stones is essential for creating personalized treatment plans and preventing recurrence. Current methods for analyzing kidney stones depend on postoperative specimens, which prevents rapid classification before surgery. To overcome this limitation, we introduce a new approach called the Urinary Stone Segmentation and Classification Network (USCNet). This innovative method allows for precise preoperative classification of kidney stones by integrating Computed Tomography (CT) images with clinical data from Electronic Health Records (EHR). USCNet employs a Transformer-based multimodal fusion framework with CT-EHR attention and segmentation-guided attention modules for accurate classification. Moreover, a dynamic loss function is introduced to effectively balance the dual objectives of segmentation and classification. Experiments on an in-house kidney stone dataset show that USCNet demonstrates outstanding performance across all evaluation metrics, with its classification efficacy significantly surpassing existing mainstream methods. This study presents a promising solution for the precise preoperative classification of kidney stones, offering substantial clinical benefits. The source code has been made publicly available: https://github.com/ZhangSongqi0506/KidneyStone.

关键词: Kidney stone classification, Transformer-based multimodal fusion, CT images, Electronic Health Records (EHR), Segmentation-guided attention, Preoperative classification, Dynamic loss function, Urolithiasis

202. ❌ Accuracy Improvement of Semi-Supervised Segmentation Using Supervised ClassMix and Sup-Unsup Feature Discriminator

作者: Takahiro Mano, Reiji Saito, Kazuhiro Hotta 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07122v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是计算机视觉领域的半监督语义分割方法，具体针对医学图像（Chase和COVID-19数据集），提出改进ClassMix的方法和特征判别器。论文内容完全围绕传统计算机视觉和深度学习中的半监督学习技术，未涉及任何大语言模型（LLMs）、大模型技术原理、AI for Science应用或其他评分关键词中的技术。所有关键词均与论文主题无关，因此相关度评分均为0。

!!! tip deepseek-chat TL;DR

该论文针对半监督语义分割中伪标签不准确和标注/未标注图像特征差异的问题，提出了使用监督ClassMix和特征判别器的方法，在Chase和COVID-19数据集上实现了平均2.07%的mIoU提升。

摘要翻译

在语义分割任务中，为训练数据创建像素级标注需要高昂的成本。为解决这一问题，半监督学习方法受到广泛关注，该方法利用少量标注图像与大量未标注图像共同提升模型性能。传统的半监督学习方法ClassMix通过将从未标注图像预测得到的类别标签粘贴至其他图像上来实现数据增强。然而，由于ClassMix使用从未标注图像获得的伪标签进行操作，存在处理不准确标签的风险。此外，标注图像与未标注图像之间存在数据质量差异，这可能影响特征图的表现。本研究针对这两个问题展开工作。首先，我们提出一种新方法，将标注图像中的类别标签及其对应图像区域同时粘贴至未标注图像及其伪标签图像上。其次，我们引入一种训练策略，使模型对未标注图像的预测结果更接近于对标注图像的预测。在Chase和COVID-19数据集上的实验表明，与传统半监督学习方法相比，本方法在平均交并比（mIoU）指标上平均提升了2.07%。

摘要 (Abstract)

In semantic segmentation, the creation of pixel-level labels for training data incurs significant costs. To address this problem, semi-supervised learning, which utilizes a small number of labeled images alongside unlabeled images to enhance the performance, has gained attention. A conventional semi-supervised learning method, ClassMix, pastes class labels predicted from unlabeled images onto other images. However, since ClassMix performs operations using pseudo-labels obtained from unlabeled images, there is a risk of handling inaccurate labels. Additionally, there is a gap in data quality between labeled and unlabeled images, which can impact the feature maps. This study addresses these two issues. First, we propose a method where class labels from labeled images, along with the corresponding image regions, are pasted onto unlabeled images and their pseudo-labeled images. Second, we introduce a method that trains the model to make predictions on unlabeled images more similar to those on labeled images. Experiments on the Chase and COVID-19 datasets demonstrated an average improvement of 2.07% in mIoU compared to conventional semi-supervised learning methods.

关键词: semi-supervised learning, semantic segmentation, ClassMix, feature discriminator, medical image analysis, pseudo-labeling, mIoU improvement, COVID-19 dataset

作者: Chenhao Liu, Zelin Wen, Yan Tong, Junjie Zhu, Xinyu Tian, Yuchi Liu, Ashu Gupta, Syed M. S. Islam, Tom Gedeon, Yue Yao 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07128v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文主要研究医疗影像数据的去隐私化处理，以促进跨医院数据共享用于AI模型训练。论文涉及生成式过滤机制处理图像和文本报告，属于AI在生物医学领域的应用。所有关键词中，只有’AI for Science OR Bioinformatics OR Cheminformatics’与论文主题相关，因为论文属于生物信息学/医疗AI应用范畴，但并非核心技术创新。其他关键词均与大模型技术原理、训练方法、推理优化、代理系统等无关，论文未涉及任何大模型或深度学习技术原理的创新，仅提及’large-scale vision-language model training’作为应用背景，但未具体研究模型本身。

!!! tip deepseek-chat TL;DR

该论文提出了一种用于跨医院放射学数据共享的效用保留去隐私化管道，通过生成式过滤机制在保护隐私的同时保留病理信息，实验表明去隐私化数据训练的模型能保持诊断准确性并有效保护隐私。

摘要翻译

大规模放射学数据对于开发稳健的医疗人工智能系统至关重要。然而，跨医院共享此类数据仍因隐私问题受到严重制约。现有的放射学去标识化研究主要集中于移除可识别信息以实现合规数据发布。然而，去标识化后的放射学数据是否仍能保留足够的效用，以支持大规模视觉-语言模型训练及跨医院迁移，目前尚未得到充分探索。本文提出了一种用于跨医院放射学数据共享的效用保持去标识化流程（UPDP）。具体而言，我们编制了一份隐私敏感术语黑名单和一份病理相关术语白名单。对于放射学图像，我们采用一种生成式过滤机制，合成原始图像的隐私过滤且病理保留的对应版本。这些合成图像对应版本与经过身份标识过滤的报告，随后可安全地在医院间共享，用于下游模型的开发与评估。在公开胸部X光基准数据集上的实验表明，我们的方法能有效移除隐私敏感信息，同时保留诊断相关的病理线索。基于去标识化数据训练的模型，与基于原始数据训练的模型相比，保持了具有竞争力的诊断准确性，同时在身份相关准确性上表现出显著下降，证实了有效的隐私保护。在跨医院场景中，我们进一步证明，去标识化数据可与本地数据结合使用，以获得更优的性能。

摘要 (Abstract)

Large-scale radiology data are critical for developing robust medical AI systems. However, sharing such data across hospitals remains heavily constrained by privacy concerns. Existing de-identification research in radiology mainly focus on removing identifiable information to enable compliant data release. Yet whether de-identified radiology data can still preserve sufficient utility for large-scale vision-language model training and cross-hospital transfer remains underexplored. In this paper, we introduce a utility-preserving de-identification pipeline (UPDP) for cross-hospital radiology data sharing. Specifically, we compile a blacklist of privacy-sensitive terms and a whitelist of pathology-related terms. For radiology images, we use a generative filtering mechanism that synthesis a privacy-filtered and pathology-reserved counterparts of the original images. These synthetic image counterparts, together with ID-filtered reports, can then be securely shared across hospitals for downstream model development and evaluation. Experiments on public chest X-ray benchmarks demonstrate that our method effectively removes privacy-sensitive information while preserving diagnostically relevant pathology cues. Models trained on the de-identified data maintain competitive diagnostic accuracy compared with those trained on the original data, while exhibiting a marked decline in identity-related accuracy, confirming effective privacy protection. In the cross-hospital setting, we further show that de-identified data can be combined with local data to yield better performance.

关键词: de-identification, radiology data sharing, privacy preservation, utility-preserving, generative filtering, cross-hospital transfer, medical AI, vision-language model

204. ❌ Novel Anomaly Detection Scenarios and Evaluation Metrics to Address the Ambiguity in the Definition of Normal Samples

作者: Reiji Saito, Satoshi Kamiya, Kazuhiro Hotta 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07097v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training	0.0	0.0/10	0.0
Post-training	0.0	0.0/10	0.0
Instruction Tuning	0.0	0.0/10	0.0
RLHF	0.0	0.0/10	0.0
PEFT	0.0	0.0/10	0.0
Retrieval-Augmented Generation	0.0	0.0/10	0.0
Context Window Extension	0.0	0.0/10	0.0
KV Cache Compression	0.0	0.0/10	0.0
Chain of Thought	0.0	0.0/10	0.0
System 2 Thinking	0.0	0.0/10	0.0
Monte Carlo Tree Search AND LLM	0.0	0.0/10	0.0
Self-Correction	0.0	0.0/10	0.0
LLM Agents	0.0	0.0/10	0.0
Tool Use	0.0	0.0/10	0.0
Multi-agent Systems	0.0	0.0/10	0.0
Quantization	0.0	0.0/10	0.0
Speculative Decoding	0.0	0.0/10	0.0
Hallucination Mitigation	0.0	0.0/10	0.0
Mechanistic Interpretability	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging	0.0	0.0/10	0.0
In-context Learning	0.0	0.0/10	0.0
AI for Science	0.0	0.0/10	0.0

评分理由: 该论文专注于工业异常检测领域，提出新的场景和评估指标来解决正常样本定义的模糊性问题，并提出了RePaste方法来增强学习。论文内容完全围绕计算机视觉和异常检测技术，未涉及任何大语言模型、深度学习技术原理、模型训练方法、推理优化、对齐技术、代理系统或AI在科学领域的应用。所有评分关键词均与大模型和深度学习技术相关，而本文研究的是传统的图像异常检测问题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对工业异常检测中正常样本定义模糊的问题，提出了新的评估场景和指标，并开发了RePaste方法，在MVTec AD基准测试中实现了最先进的性能。

摘要翻译

在传统异常检测中，训练数据仅包含正常样本。然而，在实际应用场景中，正常样本的定义往往存在模糊性。例如，某些样本可能存在微小划痕或污渍，但在实际使用中仍可被接受。另一方面，当制造设备升级时，对检测精度的要求会提高。此时，正常样本中可能包含我们希望归类为异常的微小划痕、细小灰尘颗粒或异物。这类情况在工业环境中频繁发生，但迄今为止尚未得到充分讨论。为此，我们提出了新的场景定义与评估指标，以适应实际应用中规范标准的变化。此外，为应对正常样本的模糊性问题，我们提出了RePaste方法，该方法通过将前一步骤中异常得分较高的区域重新粘贴至下一步骤的输入中，以增强学习效果。在使用MVTec AD基准数据集的场景实验中，RePaste在提出的评估指标上达到了最优性能，同时保持了较高的AUROC和PRO分数。代码地址：https://github.com/ReijiSoftmaxSaito/Scenario

摘要 (Abstract)

In conventional anomaly detection, training data consist of only normal samples. However, in real-world scenarios, the definition of a normal sample is often ambiguous. For example, there are cases where a sample has small scratches or stains but is still acceptable for practical usage. On the other hand, higher precision is required when manufacturing equipment is upgraded. In such cases, normal samples may include small scratches, tiny dust particles, or a foreign object that we would prefer to classify as an anomaly. Such cases frequently occur in industrial settings, yet they have not been discussed until now. Thus, we propose novel scenarios and an evaluation metric to accommodate specification changes in real-world applications. Furthermore, to address the ambiguity of normal samples, we propose the RePaste, which enhances learning by re-pasting regions with high anomaly scores from the previous step into the input for the next step. On our scenarios using the MVTec AD benchmark, RePaste achieved the state-of-the-art performance with respect to the proposed evaluation metric, while maintaining high AUROC and PRO scores. Code: https://github.com/ReijiSoftmaxSaito/Scenario

关键词: anomaly detection, normal sample ambiguity, industrial applications, evaluation metrics, RePaste method, MVTec AD benchmark, state-of-the-art performance

205. ❌ Location Is All You Need: Continuous Spatiotemporal Neural Representations of Earth Observation Data

作者: Mojgan Madadikhaljan, Jonathan Prexl, Isabelle Wittmann, Conrad M Albrecht, Michael Schmitt 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07092v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出LIANet，一种基于坐标的神经表示方法，用于建模多时相地球观测数据。核心贡献在于开发了一种连续时空神经场，能够从坐标重建卫星图像，并支持下游任务微调。与关键词的相关性分析如下：1）与’AI for Science’高度相关（10分），因为论文明确属于地球科学领域的AI应用；2）与’Pre-training’和’Post-training/SFT’有一定关联（各8分），因为论文涉及预训练神经表示和下游任务微调；3）与其他关键词基本无关（0分），因为论文不涉及大语言模型、MoE、推理、对齐、压缩等具体技术，而是专注于地球观测领域的特定神经表示方法。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为LIANet的连续时空神经表示方法，用于从坐标重建地球观测卫星图像，并证明其预训练后在下游任务微调中能达到与从头训练或现有地理空间基础模型相当的性能。

摘要翻译

本研究提出LIANet（Location Is All You Need Network），一种基于坐标的神经表征方法，它将特定感兴趣区域的多时相星载地球观测（Earth Observation, EO）数据建模为连续的时空神经场。仅需输入空间与时间坐标，LIANet即可重建对应的卫星影像。该神经表征经过预训练后，可适配于多种地球观测下游任务，如语义分割或像素级回归，且关键优势在于无需调用原始卫星数据。LIANet旨在为用户提供一种易用的地理空间基础模型（Geospatial Foundation Models, GFMs）替代方案，它消除了终端用户在数据获取与预处理上的负担，并支持仅基于标注数据进行微调。我们在不同尺度的目标区域上展示了LIANet的预训练过程，并证明其在下游任务中经过微调后，相较于从头训练或使用现有地理空间基础模型，能达到具有竞争力的性能。源代码与数据集已公开于https://github.com/mojganmadadi/LIANet/tree/v1.0.1。

摘要 (Abstract)

In this work, we present LIANet (Location Is All You Need Network), a coordinate-based neural representation that models multi-temporal spaceborne Earth observation (EO) data for a given region of interest as a continuous spatiotemporal neural field. Given only spatial and temporal coordinates, LIANet reconstructs the corresponding satellite imagery. Once pretrained, this neural representation can be adapted to various EO downstream tasks, such as semantic segmentation or pixel-wise regression, importantly, without requiring access to the original satellite data. LIANet intends to serve as a user-friendly alternative to Geospatial Foundation Models (GFMs) by eliminating the overhead of data access and preprocessing for end-users and enabling fine-tuning solely based on labels. We demonstrate the pretraining of LIANet across target areas of varying sizes and show that fine-tuning it for downstream tasks achieves competitive performance compared to training from scratch or using established GFMs. The source code and datasets are publicly available at https://github.com/mojganmadadi/LIANet/tree/v1.0.1.

关键词: neural representation, Earth observation, spatiotemporal, coordinate-based, pretraining, fine-tuning, satellite imagery, Geospatial Foundation Models

206. ❌ AnchorSplat: Feed-Forward 3D Gaussian SplattingWith 3D Geometric Priors

作者: Xiaoxue Zhang, Xiaoxu Zheng, Yixuan Yin, Tiao Zhao, Kaihua Tang, Michael Bi Mi, Zhan Xu, Dave Zhenyu Chen 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07053v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文AnchorSplat专注于3D场景重建的计算机视觉技术，特别是基于3D高斯泼溅（3D Gaussian Splatting）的改进方法。所有评分关键词均与大语言模型（LLMs）、深度学习技术原理或AI在科学领域的应用直接相关，而本论文的核心内容是3D几何表示和渲染优化，属于计算机图形学和3D视觉领域，与提供的关键词列表无直接关联。论文未涉及任何语言模型、模型训练技术、推理方法、对齐技术、代理系统或AI for Science的具体应用。

!!! tip deepseek-chat TL;DR

论文提出AnchorSplat，一种基于3D几何先验的锚点对齐高斯表示框架，用于场景级3D重建，在减少高斯原语数量的同时提高了重建保真度和计算效率，在ScanNet++ v2 NVS基准测试中实现了最先进的性能。

摘要翻译

近期前馈式高斯重建模型普遍采用像素对齐的构建方式，将每个二维像素映射至三维高斯分布，使得高斯表示与输入图像紧密耦合。本文提出AnchorSplat——一种新颖的前馈式三维高斯泼溅框架，用于场景级重建，直接在三维空间中表征场景。AnchorSplat引入了一种基于三维几何先验（如稀疏点云、体素或RGB-D点云）引导的锚点对齐高斯表示，从而构建出与图像分辨率和视角数量无关、更具几何感知能力的可渲染三维高斯模型。这一设计显著减少了所需高斯分布的数量，在提升重建保真度的同时提高了计算效率。除锚点对齐设计外，我们采用高斯优化器，仅通过少量前向传播即可调整中间高斯分布。在ScanNet++ v2 NVS基准测试上的实验证明了其领先性能，以更少的视图依赖性和显著更少的高斯基元数量超越了先前方法。

摘要 (Abstract)

Recent feed-forward Gaussian reconstruction models adopt a pixel-aligned formulation that maps each 2D pixel to a 3D Gaussian, entangling Gaussian representations tightly with the input images. In this paper, we propose AnchorSplat, a novel feed-forward 3DGS framework for scene-level reconstruction that represents the scene directly in 3D space. AnchorSplat introduces an anchor-aligned Gaussian representation guided by 3D geometric priors (e.g., sparse point clouds, voxels, or RGB-D point clouds), enabling a more geometry-aware renderable 3D Gaussians that is independent of image resolution and number of views. This design substantially reduces the number of required Gaussians, improving computational efficiency while enhancing reconstruction fidelity. Beyond the anchor-aligned design, we utilize a Gaussian Refiner to adjust the intermediate Gaussiansy via merely a few forward passes. Experiments on the ScanNet++ v2 NVS benchmark demonstrate the SOTA performance, outperforming previous methods with more view-consistent and substantially fewer Gaussian primitives.

关键词: 3D Gaussian Splatting, scene reconstruction, geometric priors, anchor-aligned representation, feed-forward framework, computational efficiency, view consistency, Gaussian primitives

207. ❌ PRISM: Rethinking Scattered Atmosphere Reconstruction as a Unified Understanding and Generation Model for Real-world Dehazing

作者: Chengyu Fang, Chunming He, Yuelin Zhang, Chubin Chen, Chenyang Zhu, Longxiang Tang, Xiu Li 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07048v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的图像去雾任务，提出了一种基于物理散射模型的去雾框架PRISM，并设计了非均匀雾合成和自蒸馏适应方案。论文内容完全围绕图像处理、物理模型和深度学习在视觉任务中的应用，与所有评分关键词（均涉及大语言模型、训练技术、推理方法、AI代理等NLP/通用AI主题）无直接关联。虽然论文使用了深度学习技术，但未涉及大模型、语言模型、训练对齐、推理加速等评分关键词的具体内容。

!!! tip deepseek-chat TL;DR

该论文针对真实世界图像去雾任务，提出了PRISM框架，通过物理散射模型联合重建清晰场景和散射变量，并设计了非均匀雾合成和自蒸馏适应方案，在真实世界基准测试中实现了最先进的性能。

摘要翻译

真实世界图像去雾（RID）旨在消除真实场景中由雾霾引起的退化。由于雾霾分布不均匀、多光源导致的空间变化光照以及成对真实有雾-清晰数据的稀缺性，该任务仍具挑战性。在PRISM中，我们提出了近端散射大气重建（PSAR），这是一个物理结构化的框架，能够在大气散射模型下联合重建清晰场景与散射变量，从而提升复杂区域和混合光照条件下的可靠性。为弥合合成与真实数据之间的差距，我们设计了一种在线非均匀雾霾合成流程以及针对非配对真实场景的选择性自蒸馏适应方案，使模型能够选择性地从高质量感知目标中学习，同时利用其内在的散射理解来检测残留雾霾并引导自我优化。在真实世界基准数据集上的大量实验表明，PRISM在RID任务中实现了最先进的性能。

摘要 (Abstract)

Real-world image dehazing (RID) aims to remove haze induced degradation from real scenes. This task remains challenging due to non-uniform haze distribution, spatially varying illumination from multiple light sources, and the scarcity of paired real hazy-clean data. In PRISM, we propose Proximal Scattered Atmosphere Reconstruction (PSAR), a physically structured framework that jointly reconstructs the clear scene and scattering variables under the atmospheric scattering model, thereby improving reliability in complex regions and mixed-light conditions. To bridge the synthetic-to-real gap, we design an online non-uniform haze synthesis pipeline and a Selective Self-distillation Adaptation scheme for unpaired real-world scenarios, which enables the model to selectively learn from high-quality perceptual targets while leveraging its intrinsic scattering understanding to audit residual haze and guide self-refinement. Extensive experiments on real-world benchmarks demonstrate that PRISM achieves state-of-the-art performance on RID tasks.

关键词: image dehazing, atmospheric scattering model, non-uniform haze synthesis, self-distillation adaptation, real-world image processing, physical model reconstruction, haze removal, computer vision

208. ❌ Not all tokens contribute equally to diffusion learning

作者: Guoqing Zhang, Lu Shi, Wanru Xu, Linna Zhang, Sen Wang, Fangfang Wang, Yigang Cen 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07026v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是条件扩散模型（conditional diffusion models）在文本到视频生成中的语义引导问题，提出了DARE框架来解决训练数据中的分布偏差和交叉注意力空间错位问题。所有评分关键词都明确针对大语言模型（LLMs）及其相关技术（如MoE、SFT、RLHF、RAG、CoT、Agents等），而本文专注于扩散模型（diffusion models），这是一种与LLMs不同的生成模型范式。尽管扩散模型和LLMs都属于深度学习领域，但本文未涉及任何LLM技术、架构、训练方法或应用场景。因此，所有关键词与论文内容完全无关，得分为0。

!!! tip deepseek-chat TL;DR

本文针对条件扩散模型在文本到视频生成中忽视语义重要令牌的问题，提出了DARE框架，通过分布感知校正和空间集成来改善生成保真度和语义对齐。

摘要翻译

随着条件扩散模型的快速发展，文本到视频生成领域取得了显著进展。然而，我们观察到这些模型在推理过程中常常忽略语义重要的标记，导致在无分类器引导下产生有偏差或不完整的生成结果。我们将此问题归因于两个关键因素：训练数据中标记频率的长尾分布引起的分布偏差，以及交叉注意力中的空间错位问题——语义重要的标记被信息量较低的标记所掩盖。为解决这些问题，我们提出了分布感知矫正与空间集成（DARE），这是一个从分布去偏差和空间一致性角度改进扩散模型中语义引导的统一框架。首先，我们引入分布矫正无分类器引导（DR-CFG），该方法通过动态抑制语义密度低的优势标记来规范化训练过程，鼓励模型更好地捕捉代表性不足的语义线索，并学习更平衡的条件分布。这一设计降低了模型分布过度拟合低语义密度标记的风险。其次，我们提出空间表示对齐（SRA），该方法根据标记重要性自适应地重新加权交叉注意力图并强制表示一致性，使语义重要的标记在生成过程中发挥更强的空间引导作用。该机制有效防止了低语义密度标记主导注意力分配，从而避免了高语义密度标记提供的空间与分布引导被稀释。在多个基准数据集上的大量实验表明，DARE持续提升了生成保真度与语义对齐性，较现有方法取得了显著提升。

摘要 (Abstract)

With the rapid development of conditional diffusion models, significant progress has been made in text-to-video generation. However, we observe that these models often neglect semantically important tokens during inference, leading to biased or incomplete generations under classifier-free guidance. We attribute this issue to two key factors: distributional bias caused by the long-tailed token frequency in training data, and spatial misalignment in cross-attention where semantically important tokens are overshadowed by less informative ones. To address these issues, we propose Distribution-Aware Rectification and Spatial Ensemble (DARE), a unified framework that improves semantic guidance in diffusion models from the perspectives of distributional debiasing and spatial consistency. First, we introduce Distribution-Rectified Classifier-Free Guidance (DR-CFG), which regularizes the training process by dynamically suppressing dominant tokens with low semantic density, encouraging the model to better capture underrepresented semantic cues and learn a more balanced conditional distribution. This design mitigates the risk of the model distribution overfitting to tokens with low semantic density. Second, we propose Spatial Representation Alignment (SRA), which adaptively reweights cross-attention maps according to token importance and enforces representation consistency, enabling semantically important tokens to exert stronger spatial guidance during generation. This mechanism effectively prevents low semantic-density tokens from dominating the attention allocation, thereby avoiding the dilution of the spatial and distributional guidance provided by high semantic-density tokens. Extensive experiments on multiple benchmark datasets demonstrate that DARE consistently improves generation fidelity and semantic alignment, achieving significant gains over existing approaches.

关键词: conditional diffusion models, text-to-video generation, semantic guidance, distributional bias, cross-attention, classifier-free guidance, generation fidelity, semantic alignment

209. ❌ Towards foundation-style models for energy-frontier heterogeneous neutrino detectors via self-supervised pre-training

作者: Saúl Alonso-Monsalve, Fabio Cufino, Umut Kose, Anna Mascellani, André Rubbia 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07037v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文研究的是粒子物理探测器数据分析，使用稀疏视觉Transformer（ViT）框架进行自监督预训练和联合微调，属于AI for Science（高能物理）领域，与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。论文核心方法涉及’Pre-training OR Continual Pre-training OR Domain Adaptation’（自监督预训练）和’Post-training OR Supervised Fine-tuning OR SFT’（联合微调），均为核心内容（10分）。论文提到可解释性分析，与’Mechanistic Interpretability OR Explainable AI’有一定关联（5分）。其他关键词主要涉及大语言模型（LLM）技术、推理、对齐、压缩等，本文研究的是视觉Transformer在科学数据上的应用，未涉及LLM或这些特定技术，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于稀疏视觉Transformer的自监督预训练框架，用于高能物理中异质中微子探测器的数据分析，结果表明预训练显著提升了粒子识别、动量回归和顶点重建等任务的性能，并提高了数据效率和模型可迁移性。

摘要翻译

基于加速器的中微子物理学正进入一个能量前沿领域，其中相互作用达到TeV量级并产生异常密集、重叠的探测器信号。在此领域中，传统重建方法对事件解释变得不切实际，尤其是在标记数据稀缺且分析涉及多样化下游目标时。我们提出了一种稀疏视觉Transformer（ViT）框架，用于从异构探测器数据中学习可复用的表征。自监督预训练结合了掩码自编码器重建与面向层级结构、鬼影识别和粒子识别的体素级关系目标，随后将所得的共享编码器在分类和回归任务上进行联合微调。在LHC上拟议的FASERCal概念模拟事件中评估表明，与从头训练相比，预训练持续提升了中微子味识别、粲夸克识别、动量回归和顶点重建的性能，而关系目标的加入在拓扑结构最复杂的通道中带来了进一步增益。可解释性分析进一步显示，预训练产生了更具结构化的潜在空间，而对探测器子系统的消融实验则恢复了异构输入在物理上合理的通道依赖性作用。数据效率研究表明，仅需约$10^3$个标记事件，预训练编码器在味分类性能上已与随机初始化模型使用高一个数量级数据训练的结果相当。所学表征还能有效迁移到涵盖不同探测器技术和能量尺度的公开基准测试中，达到或超过已发表的基线水平。这些结果支持了在多模态探测器数据上进行自监督预训练，作为实现中微子及粒子探测器分析可复用表征的可扩展路径。

摘要 (Abstract)

Accelerator-based neutrino physics is entering an energy-frontier regime in which interactions reach the TeV scale and produce exceptionally dense, overlapping detector signatures. In this regime, event interpretation becomes impractical for conventional reconstruction approaches, particularly when labelled data are scarce and the analysis spans diverse downstream objectives. We present a sparse ViT framework for learning reusable representations from heterogeneous detector data. Self-supervised pre-training combines masked autoencoder reconstruction with relational voxel-level objectives for hierarchy, ghost and particle identification, and the resulting shared encoder is then jointly fine-tuned across classification and regression tasks. Evaluated on simulated events from the proposed FASERCal concept at the LHC, we find that pre-training consistently improves neutrino flavour and charm-quark identification, momentum regression, and vertex reconstruction over training from scratch, with the addition of relational objectives yielding further gains in the most topologically complex channels. Interpretability analyses further show that pre-training yields a more structured latent space, while detector-subsystem ablations recover physically plausible channel-dependent roles for the heterogeneous inputs. A data-efficiency study shows that, with roughly $10^3$ labelled events, the pre-trained encoder already matches the flavour-classification performance of a randomly initialised model trained on an order of magnitude more data. The learned representations also transfer effectively to publicly available benchmarks spanning different detector technologies and energy scales, matching or exceeding published baselines. These results support self-supervised pre-training on multimodal detector data as a scalable route towards reusable representations for neutrino and particle-detector analysis.

关键词: self-supervised pre-training, sparse ViT, neutrino detectors, particle physics, representation learning, fine-tuning, AI for science, heterogeneous detector data

210. ❌ ModuSeg: Decoupling Object Discovery and Semantic Retrieval for Training-Free Weakly Supervised Segmentation

作者: Qingze He, Fagui Liu, Dengke Zhang, Qingmao Wei, Quan Tang 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07021v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究弱监督语义分割，提出了一种训练免费的框架ModuSeg，通过解耦对象发现和语义检索来改进分割性能。论文明确提到使用了基础模型（foundation models）来构建离线特征库，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（8分）。论文将分割转化为非参数特征检索过程，与’Retrieval-Augmented Generation OR RAG OR Retrieval-Generation’有一定关联（5分），但并非核心的RAG技术。其他关键词如MoE、SLMs、Scaling Laws、训练方法（Pre-training、SFT、RLHF等）、推理优化、代理系统、模型压缩等均未涉及，因此评分为0分。论文属于计算机视觉领域，未涉及生物信息学或化学信息学等科学AI应用，因此’AI for Science’相关关键词也得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种训练免费的弱监督语义分割框架ModuSeg，通过解耦对象发现和语义检索，利用基础模型构建特征库进行非参数检索，在标准基准数据集上取得了有竞争力的性能。

摘要翻译

弱监督语义分割旨在利用图像级标签实现像素级预测。现有方法通常将语义识别与目标定位相耦合，这往往导致模型仅关注稀疏的判别性区域。尽管基础模型展现出巨大潜力，许多方法仍遵循紧密耦合的优化范式，难以有效缓解伪标签噪声，且常依赖耗时的多阶段重训练或不稳定的端到端联合优化。为应对上述挑战，本文提出ModuSeg——一种以显式解耦目标发现与语义分配为核心的无训练弱监督语义分割框架。具体而言，我们整合通用掩码生成器以提取具有可靠边界的几何建议，同时利用语义基础模型构建离线特征库，将分割转化为非参数化的特征检索过程。此外，我们提出语义边界纯化与软掩码特征聚合策略，以有效缓解边界模糊性与量化误差，从而提取高质量的类别原型。大量实验表明，所提出的解耦架构在无需参数微调的情况下能更好地保留精细边界，并在标准基准数据集上取得了极具竞争力的性能。代码发布于https://github.com/Autumnair007/ModuSeg。

摘要 (Abstract)

Weakly supervised semantic segmentation aims to achieve pixel-level predictions using image-level labels. Existing methods typically entangle semantic recognition and object localization, which often leads models to focus exclusively on sparse discriminative regions. Although foundation models show immense potential, many approaches still follow the tightly coupled optimization paradigm, struggling to effectively alleviate pseudo-label noise and often relying on time-consuming multi-stage retraining or unstable end-to-end joint optimization. To address the above challenges, we present ModuSeg, a training-free weakly supervised semantic segmentation framework centered on explicitly decoupling object discovery and semantic assignment. Specifically, we integrate a general mask proposer to extract geometric proposals with reliable boundaries, while leveraging semantic foundation models to construct an offline feature bank, transforming segmentation into a non-parametric feature retrieval process. Furthermore, we propose semantic boundary purification and soft-masked feature aggregation strategies to effectively mitigate boundary ambiguity and quantization errors, thereby extracting high-quality category prototypes. Extensive experiments demonstrate that the proposed decoupled architecture better preserves fine boundaries without parameter fine-tuning and achieves highly competitive performance on standard benchmark datasets. Code is available at https://github.com/Autumnair007/ModuSeg.

关键词: weakly supervised semantic segmentation, training-free, foundation models, object discovery, semantic retrieval, feature bank, non-parametric retrieval, boundary purification

211. ❌ Synthetic Dataset Generation for Partially Observed Indoor Objects

作者: Jelle Vermandere, Maarten Bassier, Maarten Vergauwen 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07010v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉和3D场景重建领域，提出了一种用于生成合成3D扫描数据集的虚拟扫描框架，并创建了V-Scan数据集。论文内容涉及3D扫描模拟、点云生成、室内场景合成等技术，但完全不涉及大语言模型、深度学习技术原理、AI for Science等关键词所涵盖的任何主题。所有关键词均与大模型、深度学习技术、AI科学应用等相关，而本文是纯粹的3D视觉和数据集生成工作，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文解决了3D场景重建和物体补全中真实扫描数据集获取成本高的问题，通过开发Unity虚拟扫描框架和程序化室内场景生成流水线，创建了包含合成室内扫描、部分点云和完整几何真值的V-Scan数据集。

摘要翻译

基于学习的3D场景重建与物体补全方法需要包含局部扫描数据与完整真实几何配对的庞大数据集。然而，通过真实世界扫描系统获取此类数据集成本高昂且耗时，尤其是在需要被遮挡区域的精确真实数据时。本研究提出了一种在Unity中实现的虚拟扫描框架，用于生成逼真的合成3D扫描数据集。该系统通过可配置参数（如扫描分辨率、测量范围和距离相关噪声）模拟真实扫描仪的工作特性。该框架不直接对网格表面采样，而是从虚拟视点进行基于射线的扫描，从而实现对传感器可见性与遮挡效应的真实建模。此外，系统利用扫描位置捕获的全景图像为生成的点云赋予色彩。为支持可扩展的数据集创建，扫描仪与程序化室内场景生成管线集成，可自动生成多样化的房间布局与家具配置。基于此系统，我们推出了\textit{V-Scan}数据集，其中包含合成室内扫描数据、物体级局部点云、基于体素的遮挡网格以及完整的真实几何数据。该数据集为基于学习的场景重建与物体补全方法的训练与评估提供了有价值的监督信息。

摘要 (Abstract)

Learning-based methods for 3D scene reconstruction and object completion require large datasets containing partial scans paired with complete ground-truth geometry. However, acquiring such datasets using real-world scanning systems is costly and time-consuming, particularly when accurate ground truth for occluded regions is required. In this work, we present a virtual scanning framework implemented in Unity for generating realistic synthetic 3D scan datasets. The proposed system simulates the behaviour of real-world scanners using configurable parameters such as scan resolution, measurement range, and distance-dependent noise. Instead of directly sampling mesh surfaces, the framework performs ray-based scanning from virtual viewpoints, enabling realistic modelling of sensor visibility and occlusion effects. In addition, panoramic images captured at the scanner location are used to assign colours to the resulting point clouds. To support scalable dataset creation, the scanner is integrated with a procedural indoor scene generation pipeline that automatically produces diverse room layouts and furniture arrangements. Using this system, we introduce the \textit{V-Scan} dataset, which contains synthetic indoor scans together with object-level partial point clouds, voxel-based occlusion grids, and complete ground-truth geometry. The resulting dataset provides valuable supervision for training and evaluating learning-based methods for scene reconstruction and object completion.

关键词: synthetic dataset generation, 3D scene reconstruction, virtual scanning framework, point clouds, indoor objects, procedural scene generation, object completion, V-Scan dataset

212. ❌ IQ-LUT: interpolated and quantized LUT for efficient image super-resolution

作者: Yuxuan Zhang, Zhikai Dong, Xinning Chai, Xiangyun Zhou, Yi Xu, Zhengxue Cheng, Li Song 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07000v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于图像超分辨率的查找表（LUT）方法优化，仅与关键词’Quantization OR Model Compression OR Low-bit Weights’高度相关（10分），因为其核心贡献包括非均匀量化以减少存储成本。其他关键词均涉及大模型、深度学习技术原理或科学AI应用，与论文的计算机视觉和图像处理主题无关，故评0分。

!!! tip deepseek-chat TL;DR

该论文解决了图像超分辨率中查找表方法因索引空间指数增长导致的存储瓶颈问题，通过引入插值、量化和残差学习等技术，在减少存储成本的同时提升了超分辨率质量。

摘要翻译

查找表（LUT）方法在加速图像超分辨率推理方面展现出巨大潜力。然而，通过扩大感受野和增加位深来追求更高图像质量，会引发LUT索引空间的指数级增长，造成存储瓶颈，从而限制其在资源受限设备上的部署。我们提出了IQ-LUT方法，该方法在缩减LUT大小的同时，提升了超分辨率质量。首先，我们将插值与量化技术集成到单输入多输出的ECNN中，显著压缩了索引空间，从而大幅降低了整体LUT的存储需求。其次，残差学习的引入降低了对LUT位深的依赖，这有助于提升训练稳定性，并优先重建细粒度细节以获得更优的视觉质量。最后，在知识蒸馏的指导下，我们的非均匀量化过程优化了量化层级，从而在减少存储的同时有效补偿了量化损失。大量基准测试表明，我们的方法在显著降低存储成本（相比ECNN最高可达50倍）的同时，实现了更优的超分辨率质量。

摘要 (Abstract)

Lookup table (LUT) methods demonstrate considerable potential in accelerating image super-resolution inference. However, pursuing higher image quality through larger receptive fields and bit-depth triggers exponential growth in the LUT’s index space, creating a storage bottleneck that limits deployment on resource-constrained devices. We introduce IQ-LUT, which achieves a reduction in LUT size while simultaneously enhancing super-resolution quality. First, we integrate interpolation and quantization into the single-input, multiple-output ECNN, which dramatically reduces the index space and thereby the overall LUT size. Second, the integration of residual learning mitigates the dependence on LUT bit-depth, which facilitates training stability and prioritizes the reconstruction of fine-grained details for superior visual quality. Finally, guided by knowledge distillation, our non-uniform quantization process optimizes the quantization levels, thereby reducing storage while also compensating for quantization loss. Extensive benchmarking demonstrates our approach substantially reduces storage costs (by up to 50x compared to ECNN) while achieving superior super-resolution quality.

关键词: Lookup table, Image super-resolution, Quantization, Storage reduction, Interpolation, Knowledge distillation, ECNN, Residual learning

213. ❌ Canopy Tree Height Estimation Using Quantile Regression: Modeling and Evaluating Uncertainty in Remote Sensing

作者: Karsten Schrödter, Jan Pauls, Fabian Gieseke 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06988v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于遥感领域的树高估计，使用分位数回归进行不确定性量化。所有关键词均与大模型、深度学习技术原理或AI在科学领域的应用相关，但论文内容完全不涉及大模型、深度学习或AI技术本身，仅使用传统统计方法（分位数回归）处理遥感数据。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为遥感属于广义的科学应用，但论文未使用AI方法，因此给予5分（有一定关联）。其他关键词均与大模型技术、训练方法、推理优化、代理系统等无关，评分为0。

!!! tip deepseek-chat TL;DR

该研究通过分位数回归改进基于卫星数据的树高估计模型，以提供统计校准的不确定性估计，并证明模型在复杂地形和植被条件下的置信度较低。

摘要翻译

精确的树木高度估算对于生态监测与生物量评估至关重要。本研究将分位数回归应用于现有基于卫星数据的树高估算模型，以纳入不确定性量化。当前大多数树高估算方法依赖于点预测，这限制了其在风险敏感场景中的适用性。本研究表明，通过对给定预测头进行微小修改，现有模型可通过分位数回归适配为提供统计校准的不确定性估计。此外，我们论证了模型结果如何与遥感领域的已知挑战（如地形复杂性、植被异质性）相关联，表明模型在更具挑战性的条件下置信度较低。

摘要 (Abstract)

Accurate tree height estimation is vital for ecological monitoring and biomass assessment. We apply quantile regression to existing tree height estimation models based on satellite data to incorporate uncertainty quantification. Most current approaches for tree height estimation rely on point predictions, which limits their applicability in risk-sensitive scenarios. In this work, we show that, with minor modifications of a given prediction head, existing models can be adapted to provide statistically calibrated uncertainty estimates via quantile regression. Furthermore, we demonstrate how our results correlate with known challenges in remote sensing (e.g., terrain complexity, vegetation heterogeneity), indicating that the model is less confident in more challenging conditions.

关键词: tree height estimation, quantile regression, uncertainty quantification, remote sensing, satellite data, ecological monitoring, biomass assessment, statistical calibration

214. ❌ Compression as an Adversarial Amplifier Through Decision Space Reduction

作者: Lewis Evans, Harkrishan Jandu, Zihan Ye, Yang Lu, Shreyank N Gowda 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06954v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究图像压缩对深度图像分类器对抗鲁棒性的影响，属于计算机视觉和对抗机器学习领域。所有关键词均与大模型、深度学习技术原理或科学AI应用相关，而本文专注于图像分类和压缩，与这些关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文研究了图像压缩如何通过决策空间缩减放大对抗攻击的有效性，发现压缩感知攻击比像素空间攻击更具破坏性。

摘要翻译

图像压缩是现代视觉处理流程中无处不在的组成部分，通常由社交媒体平台和资源受限系统在推理前应用。尽管其应用广泛，压缩对抗鲁棒性的影响仍鲜为人知。我们研究了一种先前未被探索的对抗场景，其中攻击直接在压缩表示中实施，并证明压缩可以充当深度图像分类器的对抗性放大器。在相同的标称扰动预算下，具备压缩感知的攻击比其像素空间对应方法更为有效。我们将此效应归因于决策空间的缩减，即压缩诱导了一种不可逆的、丢失信息的变换，这种变换压缩了分类边界并增加了对扰动的敏感性。在标准基准和架构上进行的大量实验支持了我们的分析，并揭示了在包含压缩环节的部署设置中存在一个关键漏洞。代码将予以公开。

摘要 (Abstract)

Image compression is a ubiquitous component of modern visual pipelines, routinely applied by social media platforms and resource-constrained systems prior to inference. Despite its prevalence, the impact of compression on adversarial robustness remains poorly understood. We study a previously unexplored adversarial setting in which attacks are applied directly in compressed representations, and show that compression can act as an adversarial amplifier for deep image classifiers. Under identical nominal perturbation budgets, compression-aware attacks are substantially more effective than their pixel-space counterparts. We attribute this effect to decision space reduction, whereby compression induces a non-invertible, information-losing transformation that contracts classification margins and increases sensitivity to perturbations. Extensive experiments across standard benchmarks and architectures support our analysis and reveal a critical vulnerability in compression-in-the-loop deployment settings. Code will be released.

关键词: image compression, adversarial robustness, deep image classifiers, decision space reduction, compression-aware attacks, classification margins, vulnerability, compression-in-the-loop

215. ❌ Auditing Demographic Bias in Facial Landmark Detection for Fair Human-Robot Interaction

作者: Pablo Parte, Roberto Valle, José M. Buenaposada, Luis Baumela 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06961v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是计算机视觉领域的面部关键点检测任务中的公平性问题，具体关注年龄、性别和种族偏见。论文内容完全围绕计算机视觉、公平性审计、统计方法和人机交互展开，不涉及任何大语言模型、深度学习技术原理创新或AI在科学领域的应用。所有关键词均与大语言模型、深度学习技术或AI科学应用相关，与该论文的计算机视觉研究主题完全无关。

!!! tip deepseek-chat TL;DR

该论文系统审计了面部关键点检测模型中的年龄、性别和种族偏见，发现图像分辨率和头部姿态等混杂因素比人口属性影响更大，消除混杂因素后性别和种族差异消失，但老年群体仍存在显著性能偏差。

摘要翻译

人机交互中的公平性关键取决于机器人感知模型解释人类行为的可靠性。尽管人口统计学偏倚已在高层级面部分析任务中得到广泛研究，但其在面部关键点检测中的存在仍未得到探索。本文对该任务中的人口统计学偏倚进行了系统性审计，分析了年龄、性别和种族偏倚。为此，我们引入了一种受控统计方法，以分离人口统计学效应与混杂视觉因素。对标准代表性模型的评估表明，混杂视觉因素——特别是头部姿态和图像分辨率——的影响远超人口统计学属性。值得注意的是，在控制这些混杂因素后，我们发现跨性别和种族的性能差异基本消失。然而，我们识别出统计学上显著的年龄相关效应，在年长个体中观察到更高的偏倚。这表明公平性问题甚至可能在低层级视觉组件中出现，并可能通过人机交互流程传播，对弱势群体造成不成比例的影响。我们认为，审计并修正此类偏倚是构建可信且公平的机器人感知系统的必要步骤。

摘要 (Abstract)

Fairness in human-robot interaction critically depends on the reliability of the perceptual models that enable robots to interpret human behavior. While demographic biases have been widely studied in high-level facial analysis tasks, their presence in facial landmark detection remains unexplored. In this paper, we conduct a systematic audit of demographic bias in this task, analyzing the age, gender and race biases. To this end we introduce a controlled statistical methodology to disentangle demographic effects from confounding visual factors. Evaluations of a standard representative model demonstrate that confounding visual factors, particularly head pose and image resolution, heavily outweigh the impact of demographic attributes. Notably, after accounting for these confounders, we show that performance disparities across gender and race vanish. However, we identify a statistically significant age-related effect, with higher biases observed for older individuals. This shows that fairness issues can emerge even in low-level vision components and can propagate through the HRI pipeline, disproportionately affecting vulnerable populations. We argue that auditing and correcting such biases is a necessary step toward trustworthy and equitable robot perception systems.

关键词: facial landmark detection, demographic bias, fairness audit, human-robot interaction, statistical methodology, age bias, confounding factors, robot perception

216. ❌ MAR-GRPO: Stabilized GRPO for AR-diffusion Hybrid Image Generation

作者: Xiaoxiao Ma, Jiachen Lei, Tianfei Ren, Jie Huang, Siming Fu, Aiming Hao, Jiahong Wu, Xiangxiang Chu, Feng Zhao 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06966v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于图像生成领域，研究混合自回归-扩散模型（MAR）的强化学习训练稳定性问题，提出MAR-GRPO框架。所有评分关键词均与大语言模型（LLM）相关，包括技术原理、训练方法、推理优化、应用等。论文内容完全不涉及LLM，而是针对计算机视觉中的图像生成模型（自回归和扩散模型）进行RL优化，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对混合自回归-扩散图像生成模型在强化学习训练中的不稳定问题，提出了MAR-GRPO框架，通过多轨迹期望和一致性感知令牌选择策略，显著提升了训练稳定性、视觉质量和空间结构理解。

摘要翻译

强化学习（RL）已成功应用于自回归（AR）模型与扩散模型。然而，将RL扩展至混合AR-扩散框架仍面临挑战，主要源于交错的推理过程与带噪声的对数概率估计。本研究聚焦于掩码自回归模型（MAR），发现扩散头在训练动态中起着关键作用，其常引入带噪声的梯度，导致训练不稳定与早期性能饱和。为解决此问题，我们提出一种针对MAR的稳定化RL框架。我们引入了多轨迹期望（MTE），通过对多条扩散轨迹取平均来估计优化方向，从而降低扩散过程引发的梯度噪声。为避免过度平滑，我们进一步从多条轨迹中估计词元级不确定性，并仅对不确定性最高的前k%词元应用多轨迹优化。此外，我们提出一种一致性感知的词元选择策略，用于筛除与最终生成内容对齐度较低的AR词元。在多个基准测试上的广泛实验表明，相较于基线GRPO及RL训练前的模型，我们的方法在视觉质量、训练稳定性及空间结构理解方面均取得持续提升。代码发布于：https://github.com/AMAP-ML/mar-grpo。

摘要 (Abstract)

Reinforcement learning (RL) has been successfully applied to autoregressive (AR) and diffusion models. However, extending RL to hybrid AR-diffusion frameworks remains challenging due to interleaved inference and noisy log-probability estimation. In this work, we study masked autoregressive models (MAR) and show that the diffusion head plays a critical role in training dynamics, often introducing noisy gradients that lead to instability and early performance saturation. To address this issue, we propose a stabilized RL framework for MAR. We introduce multi-trajectory expectation (MTE), which estimates the optimization direction by averaging over multiple diffusion trajectories, thereby reducing diffusion-induced gradient noise. To avoid over-smoothing, we further estimate token-wise uncertainty from multiple trajectories and apply multi-trajectory optimization only to the top-k% uncertain tokens. In addition, we introduce a consistency-aware token selection strategy that filters out AR tokens that are less aligned with the final generated content. Extensive experiments across multiple benchmarks demonstrate that our method consistently improves visual quality, training stability, and spatial structure understanding over baseline GRPO and pre-RL models. Code is available at: https://github.com/AMAP-ML/mar-grpo.

关键词: reinforcement learning, autoregressive models, diffusion models, hybrid AR-diffusion, training stability, gradient noise, multi-trajectory expectation, token-wise uncertainty

217. ❌ NTIRE 2026 Challenge on Bitstream-Corrupted Video Restoration: Methods and Results

作者: Wenbin Zou, Tianyi Li, Kejun Wu, Huiping Zhuang, Zongwei Wu, Zhuyun Zhou, Radu Timofte, Kim-Hui Yap, Lap-Pui Chau, Yi Wang, Shiqi Zhou, Xiaodi Shi, Yuxiang Chen, Yilian Zhong, Shibo Yin, Yushun Fang, Xilei Zhu, Yahui Wang, Chen Lu, Zhitao Wang, Lifa Ha, Hengyu Man, Xiaopeng Fan, Priyansh Singh, Sidharth, Krrish Dev, Soham Kakkar, Vinit Jakhetiya, Ovais Iqbal Shah, Wei Zhou, Linfeng Li, Qi Xu, Zhenyang Liu, Kepeng Xu, Tong Qiao, Jiachen Tu, Guoyi Xu, Yaoxin Jiang, Jiajia Liu, Yaokun Shi 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06945v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文是关于比特流损坏视频修复的计算机视觉挑战赛报告，主要涉及视频处理、图像恢复、压缩伪影去除等传统计算机视觉任务。论文内容完全不涉及大语言模型、深度学习技术原理创新、AI for Science等关键词领域，所有关键词均与论文主题无关。

!!! tip deepseek-chat TL;DR

该论文报告了NTIRE 2026比特流损坏视频修复挑战赛，旨在从损坏的比特流中恢复视觉连贯的视频，并总结了数据集、评估协议、参与方法和主要技术趋势。

摘要翻译

本文报告了NTIRE 2026比特流损坏视频修复挑战赛（BSCVR）。该挑战旨在推进从损坏比特流中恢复视觉连贯视频的研究，此类比特流解码后常产生严重的时空伪影与内容失真。基于近期比特流损坏视频恢复的研究进展，本挑战为在真实损坏场景下评估修复方法提供了统一基准。我们介绍了所用数据集、评估流程及参赛方法，并总结了最终结果与主要技术趋势。本次挑战凸显了这一新兴任务的难度，并为未来在实际比特流损坏条件下的鲁棒性视频修复研究提供了有益见解。

摘要 (Abstract)

This paper reports on the NTIRE 2026 Challenge on Bitstream-Corrupted Video Restoration (BSCVR). The challenge aims to advance research on recovering visually coherent videos from corrupted bitstreams, whose decoding often produces severe spatial-temporal artifacts and content distortion. Built upon recent progress in bitstream-corrupted video recovery, the challenge provides a common benchmark for evaluating restoration methods under realistic corruption settings. We describe the dataset, evaluation protocol, and participating methods, and summarize the final results and main technical trends. The challenge highlights the difficulty of this emerging task and provides useful insights for future research on robust video restoration under practical bitstream corruption.

关键词: video restoration, bitstream corruption, spatial-temporal artifacts, content distortion, benchmark evaluation, robust video restoration, NTIRE challenge, corrupted video recovery

218. ❌ Making MLLMs Blind: Adversarial Smuggling Attacks in MLLM Content Moderation

作者: Zhiheng Li, Zongyang Ma, Yuntong Pan, Ziqi Zhang, Xiaolei Lv, Bo Li, Jun Gao, Jianing Zhang, Chunfeng Yuan, Bing Li, Weiming Hu 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06950v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究多模态大语言模型（MLLMs）在内容审核中的对抗性攻击，核心涉及大模型（LLMs）的安全漏洞和防御，因此与’Large Language Models’高度相关（10分）。论文探索了通过监督微调（SFT）进行对抗训练来缓解攻击，与’Post-training’有一定关联（5分）。论文测试了思维链（CoT）推理作为缓解策略，与’Chain of Thought’有一定关联（5分）。攻击涉及模型无法检测有害内容，与’Factuality’相关（5分）。其他关键词如MoE、SLMs、Scaling Laws、RLHF等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文揭示了多模态大语言模型在内容审核中面临的新型对抗性攻击——对抗性走私攻击，通过将有害内容编码为人类可读但AI不可读的视觉格式来逃避检测，并在构建的基准测试中显示当前最先进模型的攻击成功率超过90%，同时初步探索了思维链推理和对抗训练作为缓解策略。

摘要翻译

多模态大语言模型正日益被部署为自动化内容审核工具。在此背景下，我们发现了一种关键威胁：对抗性走私攻击。与旨在导致错误分类的对抗性扰动和旨在生成有害输出的对抗性越狱不同，对抗性走私攻击利用了人类与AI之间的能力差距。它将有害内容编码为人类可读但AI无法识别的视觉格式，从而规避自动检测，实现有害内容的传播。我们将走私攻击分为两种路径：(1) 感知盲区，干扰文本识别；(2) 推理阻断，即使在文本识别成功的情况下，仍抑制语义理解。为评估此威胁，我们构建了SmuggleBench，这是首个包含1700个对抗性走私攻击实例的综合基准。在SmuggleBench上的评估表明，无论是专有模型（如GPT-5）还是开源模型（如Qwen3-VL）等最先进模型均易受此威胁影响，攻击成功率超过90%。通过从感知和推理角度分析其脆弱性，我们确定了三个根本原因：视觉编码器的能力局限、OCR（光学字符识别）的鲁棒性差距，以及领域特定对抗样本的稀缺性。我们对缓解策略进行了初步探索，研究了测试时扩展（通过思维链）和对抗性训练（通过监督微调）在缓解此威胁方面的潜力。我们的代码公开于https://github.com/zhihengli-casia/smugglebench。

摘要 (Abstract)

Multimodal Large Language Models (MLLMs) are increasingly being deployed as automated content moderators. Within this landscape, we uncover a critical threat: Adversarial Smuggling Attacks. Unlike adversarial perturbations (for misclassification) and adversarial jailbreaks (for harmful output generation), adversarial smuggling exploits the Human-AI capability gap. It encodes harmful content into human-readable visual formats that remain AI-unreadable, thereby evading automated detection and enabling the dissemination of harmful content. We classify smuggling attacks into two pathways: (1) Perceptual Blindness, disrupting text recognition; and (2) Reasoning Blockade, inhibiting semantic understanding despite successful text recognition. To evaluate this threat, we constructed SmuggleBench, the first comprehensive benchmark comprising 1,700 adversarial smuggling attack instances. Evaluations on SmuggleBench reveal that both proprietary (e.g., GPT-5) and open-source (e.g., Qwen3-VL) state-of-the-art models are vulnerable to this threat, producing Attack Success Rates (ASR) exceeding 90%. By analyzing the vulnerability through the lenses of perception and reasoning, we identify three root causes: the limited capabilities of vision encoders, the robustness gap in OCR, and the scarcity of domain-specific adversarial examples. We conduct a preliminary exploration of mitigation strategies, investigating the potential of test-time scaling (via CoT) and adversarial training (via SFT) to mitigate this threat. Our code is publicly available at https://github.com/zhihengli-casia/smugglebench.

关键词: Multimodal Large Language Models, Adversarial Smuggling Attacks, Content Moderation, Perceptual Blindness, Reasoning Blockade, Attack Success Rate, Chain of Thought, Supervised Fine-tuning

219. ❌ Grounded Forcing: Bridging Time-Independent Semantics and Proximal Dynamics in Autoregressive Video Synthesis

作者: Jintao Chen, Chengyu Bai, Junjun hu, Xinda Xue, Mu Xu 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06939v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	5.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于自回归视频合成技术，提出Grounded Forcing框架解决长期一致性问题。论文核心贡献包括Dual Memory KV Cache、Dual-Reference RoPE Injection和Asymmetric Proximity Recache三个机制。这些机制主要涉及视频生成中的缓存管理、位置编码和语义继承，与大多数大模型技术关键词无关。唯一相关的是’KV Cache Compression OR Linear Attention OR FlashAttention’关键词，因为论文明确提出了Dual Memory KV Cache机制，这是对KV缓存的一种创新设计，用于解决语义遗忘问题，因此给予5分（有一定关联）。其他关键词均未在论文中涉及。

!!! tip deepseek-chat TL;DR

该论文针对自回归视频合成中的语义遗忘、视觉漂移和可控性损失三大挑战，提出了Grounded Forcing框架，通过双记忆KV缓存、双参考RoPE注入和非对称邻近重缓存三个机制，显著提升了长视频生成的语义一致性和视觉稳定性。

摘要翻译

自回归视频生成为无限时长生成提供了前景广阔的路径，但其根本上受到三个相互交织的挑战所阻碍：因上下文限制导致的语义遗忘、由位置外推引起的视觉漂移，以及在交互式指令切换过程中的可控性丧失。现有方法通常孤立地处理这些问题，限制了长期连贯性。我们提出了“锚定强制”（Grounded Forcing）这一新颖框架，它通过三个互锁机制，桥接了时间无关的语义与邻近动态。首先，针对语义遗忘问题，我们提出了一种双记忆KV缓存（Dual Memory KV Cache），将局部时序动态与全局语义锚点解耦，从而确保长期语义连贯性与身份稳定性。其次，为抑制视觉漂移，我们设计了双参考RoPE注入（Dual-Reference RoPE Injection），将位置嵌入限制在训练流形内，同时使全局语义具有时间不变性。第三，为解决可控性问题，我们开发了非对称邻近重缓存（Asymmetric Proximity Recache），通过邻近加权的缓存更新，在提示词转换期间实现平滑的语义继承。这些组件协同运作，将生成过程锚定于稳定的语义核心，同时适应灵活的局部动态。大量实验表明，“锚定强制”框架显著提升了长程一致性与视觉稳定性，为交互式长视频合成奠定了坚实基础。

摘要 (Abstract)

Autoregressive video synthesis offers a promising pathway for infinite-horizon generation but is fundamentally hindered by three intertwined challenges: semantic forgetting from context limitations, visual drift due to positional extrapolation, and controllability loss during interactive instruction switching. Current methods often tackle these issues in isolation, limiting long-term coherence. We introduce Grounded Forcing, a novel framework that bridges time-independent semantics and proximal dynamics through three interlocking mechanisms. First, to address semantic forgetting, we propose a Dual Memory KV Cache that decouples local temporal dynamics from global semantic anchors, ensuring long-term semantic coherence and identity stability. Second, to suppress visual drift, we design Dual-Reference RoPE Injection, which confines positional embeddings within the training manifold while rendering global semantics time-invariant. Third, to resolve controllability issues, we develop Asymmetric Proximity Recache, which facilitates smooth semantic inheritance during prompt transitions via proximity-weighted cache updates. These components operate synergistically to tether the generative process to stable semantic cores while accommodating flexible local dynamics. Extensive experiments demonstrate that Grounded Forcing significantly enhances long-range consistency and visual stability, establishing a robust foundation for interactive long-form video synthesis.

关键词: Autoregressive Video Synthesis, Long-term Coherence, KV Cache, Positional Embeddings, Semantic Forgetting, Visual Drift, Controllability, Grounded Forcing

220. ❌ POS-ISP: Pipeline Optimization at the Sequence Level for Task-aware ISP

作者: Jiyun Won, Heemin Yang, Woohyeok Kim, Jungseul Ok, Sunghyun Cho 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06938v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文POS-ISP专注于图像信号处理（ISP）管道的序列级优化，使用强化学习框架解决模块序列和参数的联合优化问题。所有评分关键词均与大语言模型、深度学习技术原理或AI科学应用相关，而本文研究的是计算机视觉领域的ISP优化，与这些关键词无直接关联。论文未涉及大模型、深度学习技术原理或AI在科学领域的应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种序列级强化学习框架POS-ISP，用于优化图像信号处理管道，通过全局序列预测提高任务性能并降低计算成本。

摘要翻译

近期研究通过组合预定义模块并使其适应特定任务目标，探索了针对不同任务优化图像信号处理（ISP）流水线的方法。然而，联合优化模块序列与参数仍具挑战性。现有方法依赖于神经架构搜索（NAS）或分步强化学习（RL），但NAS存在训练-推理不匹配问题，而分步RL由于需进行阶段式决策，易导致训练不稳定且计算开销高昂。我们提出POS-ISP——一种序列级强化学习框架，将模块化ISP优化构建为全局序列预测问题。该方法通过单次前向传播预测完整的模块序列及其参数，并利用终端任务奖励优化流水线，从而无需中间监督和冗余执行。在多个下游任务上的实验表明，POS-ISP在提升任务性能的同时降低了计算成本，凸显了序列级优化作为任务感知ISP的一种稳定高效范式。项目页面详见：https://w1jyun.github.io/POS-ISP

摘要 (Abstract)

Recent work has explored optimizing image signal processing (ISP) pipelines for various tasks by composing predefined modules and adapting them to task-specific objectives. However, jointly optimizing module sequences and parameters remains challenging. Existing approaches rely on neural architecture search (NAS) or step-wise reinforcement learning (RL), but NAS suffers from a training-inference mismatch, while step-wise RL leads to unstable training and high computational overhead due to stage-wise decision-making. We propose POS-ISP, a sequence-level RL framework that formulates modular ISP optimization as a global sequence prediction problem. Our method predicts the entire module sequence and its parameters in a single forward pass and optimizes the pipeline using a terminal task reward, eliminating the need for intermediate supervision and redundant executions. Experiments across multiple downstream tasks show that POS-ISP improves task performance while reducing computational cost, highlighting sequence-level optimization as a stable and efficient paradigm for task-aware ISP. The project page is available at https://w1jyun.github.io/POS-ISP

关键词: image signal processing, ISP pipeline optimization, sequence-level reinforcement learning, modular ISP, task-aware ISP, computational cost reduction, neural architecture search, terminal task reward

221. ❌ Energy-Regularized Spatial Masking: A Novel Approach to Enhancing Robustness and Interpretability in Vision Models

作者: Tom Devynck Bilal Faye Djamel Bouchaffra Nadjib Lazaar Hanane Azzag Mustapha Lebbah 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06893v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉领域，提出了一种名为ERSM的新框架，用于增强卷积神经网络的鲁棒性和可解释性。论文的核心内容涉及视觉模型、特征选择、空间掩码、能量最小化、稀疏性和可解释性。所有关键词均与大语言模型（LLMs）或大模型在不同领域的应用直接相关，而本文研究的是纯视觉模型，与LLMs无关。唯一相关的关键词是’Mechanistic Interpretability OR Explainable AI’，因为论文明确提到了’highly interpretable spatial masks’和’interpretability’，但这不是论文的核心创新点（核心是ERSM框架），因此给予5分（有一定关联）。其他关键词均与LLMs、语言模型、对齐、推理、代理、科学AI等无关，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为能量正则化空间掩码（ERSM）的新框架，通过将特征选择重新表述为可微分的能量最小化问题，使卷积神经网络能够自主发现最优信息密度平衡，从而在保持分类准确性的同时，实现涌现的稀疏性、提高对结构化遮挡的鲁棒性，并产生高度可解释的空间掩码。

摘要翻译

深度卷积神经网络通过穷举处理密集空间特征图取得了卓越性能，但这种暴力策略引入了显著的计算冗余，并加剧了对虚假背景相关性的依赖。因此，现代视觉模型仍然脆弱且难以解释。我们提出能量正则化空间掩码（Energy-Regularized Spatial Masking, ERSM），这是一个将特征选择重新表述为可微分能量最小化问题的新颖框架。通过在标准卷积主干中嵌入轻量级的能量掩码层，每个视觉标记被分配一个由两种竞争力量构成的标量能量：内在的一元重要性成本与成对空间一致性惩罚。与先前强制刚性稀疏预算或依赖启发式重要性分数的剪枝方法不同，ERSM允许网络针对每个输入自主发现最优的信息密度平衡点。我们在卷积架构上验证了ERSM，证明其能够产生涌现稀疏性、提升对结构化遮挡的鲁棒性以及生成高度可解释的空间掩码，同时保持分类准确性。此外，我们发现在基于删除的鲁棒性测试中，所学得的能量排序显著优于基于幅度的剪枝方法，这揭示了ERSM作为一种内在去噪机制，能够在无需像素级监督的情况下分离出语义对象区域。

摘要 (Abstract)

Deep convolutional neural networks achieve remarkable performance by exhaustively processing dense spatial feature maps, yet this brute-force strategy introduces significant computational redundancy and encourages reliance on spurious background correlations. As a result, modern vision models remain brittle and difficult to interpret. We propose Energy-Regularized Spatial Masking (ERSM), a novel framework that reformulates feature selection as a differentiable energy minimization problem. By embedding a lightweight Energy-Mask Layer inside standard convolutional backbones, each visual token is assigned a scalar energy composed of two competing forces: an intrinsic Unary importance cost and a Pairwise spatial coherence penalty. Unlike prior pruning methods that enforce rigid sparsity budgets or rely on heuristic importance scores, ERSM allows the network to autonomously discover an optimal information-density equilibrium tailored to each input. We validate ERSM on convolutional architectures and demonstrate that it produces emergent sparsity, improved robustness to structured occlusion, and highly interpretable spatial masks, while preserving classification accuracy. Furthermore, we show that the learned energy ranking significantly outperforms magnitude-based pruning in deletion-based robustness tests, revealing ERSM as an intrinsic denoising mechanism that isolates semantic object regions without pixel-level supervision.

关键词: Energy-Regularized Spatial Masking, ERSM, convolutional neural networks, feature selection, energy minimization, sparsity, robustness, interpretability

222. ❌ Time-driven Survival Analysis from FDG-PET/CT in Non-Small Cell Lung Cancer

作者: Sambit Tarai, Ashish Chauhan, Elin Lundström, Johan Öfverstedt, Therese Sjöholm, Veronica Sanchez Rodriguez, Håkan Ahlström, Joel Kullberg 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06885v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于使用深度学习（ResNet-50）进行医学影像分析，以预测非小细胞肺癌患者的总体生存期，属于AI在生物医学领域的应用。论文未涉及任何大语言模型（LLM）相关技术，如预训练、微调、推理优化、智能体等。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在生物医学（癌症预后）领域的应用，但并非核心创新点，因此给予5分（有一定关联）。其他所有关键词均与论文内容完全无关，得0分。

!!! tip deepseek-chat TL;DR

该研究开发了一个基于FDG-PET/CT影像和临床数据的深度学习框架，用于预测非小细胞肺癌患者的随时间变化的总体生存期，结果显示结合时间数据的模型比基线方法在AUC上提升了4.3%。

摘要翻译

目的：基于医学影像的临床结局（如总生存期）自动化预测在改善患者预后和个性化治疗规划方面具有巨大潜力。我们开发了一种深度回归框架，以组织层面的FDG-PET/CT投影图像作为输入，并结合代表标量时间范围（以天为单位）的时序输入，用于预测非小细胞肺癌（Non-Small Cell Lung Cancer, NSCLC）患者的总生存期。
方法：该框架采用ResNet-50主干网络处理输入图像并生成相应的图像嵌入向量。随后，这些嵌入向量与时间数据结合，生成随时间变化的总生存期概率函数，从而基于时间参数化预测结果。整个框架使用U-CAN队列（n = 556）进行开发，并在测试集（n = 292）上与基线方法进行比较评估。基线方法采用ResNet-50架构，仅处理图像输入，并在预设时间点（如2年或5年）提供总生存期预测。
结果：将时序数据与图像嵌入向量相结合的方法在预测总生存期方面显示出优势，其AUC较基线方法提升4.3%。所提出的模型在使用临床+IDP特征时表现出色，而影像模型与临床+IDP模型的集成取得了最佳整体性能（AUC=0.788），凸显了多模态输入的互补价值。该方法还能将患者风险分层为不同类别（高风险与低风险）。显著性分析生成的热图显示，肿瘤区域是预测的关键结构。
结论：我们的方法提供了一个自动化框架，能够预测随时间变化的总生存期，并证明了结合影像与表格数据可改善生存预测的潜力。

摘要 (Abstract)

Purpose: Automated medical image-based prediction of clinical outcomes, such as overall survival (OS), has great potential in improving patient prognostics and personalized treatment planning. We developed a deep regression framework using tissue-wise FDG-PET/CT projections as input, along with a temporal input representing a scalar time horizon (in days) to predict OS in patients with Non-Small Cell Lung Cancer (NSCLC). Methods: The proposed framework employed a ResNet-50 backbone to process input images and generate corresponding image embeddings. The embeddings were then combined with temporal data to produce OS probabilities as a function of time, effectively parameterizing the predictions based on time. The overall framework was developed using the U-CAN cohort (n = 556) and evaluated by comparing with a baseline method on the test set (n = 292). The baseline utilized the ResNet-50 architecture, processing only the images as input and providing OS predictions at pre-specified intervals, such as 2- or 5-year. Results: The incorporation of temporal data with image embeddings demonstrated an advantage in predicting OS, outperforming the baseline method with an improvement in AUC of 4.3%. The proposed model using clinical + IDP features achieved strong performance, and an ensemble of imaging and clinical + IDP models achieved the best overall performance (0.788), highlighting the complementary value of multimodal inputs. The proposed method also enabled risk stratification of patients into distinct categories (high vs low risk). Heat maps from the saliency analysis highlighted tumor regions as key structures for the prediction. Conclusion: Our method provided an automated framework for predicting OS as a function of time and demonstrates the potential of combining imaging and tabular data for improved survival prediction.

关键词: Non-Small Cell Lung Cancer, FDG-PET/CT, survival prediction, deep regression, ResNet-50, temporal data, multimodal inputs, risk stratification

223. ❌ SCT-MOT: Enhancing Air-to-Air Multiple UAVs Tracking with Swarm-Coupled Motion and Trajectory Guidance

作者: Zhaochen Chu, Tao Song, Ren Jin, Shaoming He, Defu Lin, Siqing Cheng 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06883v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于无人机群的多目标跟踪（MOT）问题，提出了一种结合群体运动建模和轨迹引导特征融合的跟踪框架。研究内容属于计算机视觉和机器人领域的特定应用，与所有评分关键词（均围绕大模型、深度学习技术原理及其在科学领域的应用）完全无关。论文未涉及任何大模型、语言模型、训练方法、推理技术、对齐、压缩、代理系统或AI for Science相关内容。

!!! tip deepseek-chat TL;DR

该论文针对空中无人机群跟踪中因复杂非线性群体运动和弱视觉线索导致的轨迹碎片化和身份切换问题，提出了SCT-MOT框架，通过群体运动感知轨迹预测和轨迹引导时空特征融合模块，在多个数据集上实现了优于现有方法的跟踪精度和鲁棒性。

摘要翻译

针对无人机集群的空对空跟踪，由于群体运动呈现复杂的非线性特征且小型目标视觉线索微弱，常导致检测失败、轨迹断裂与身份切换等问题。现有方法虽尝试通过轨迹预测提升性能，但通常独立建模各目标运动，忽视了集群层面的运动关联性。同时，运动预测与外观表征的有限整合削弱了视觉模糊杂乱场景下跟踪所需的时空一致性，导致难以维持连贯轨迹与可靠关联。为解决这些挑战，我们提出SCT-MOT跟踪框架，该框架集成了集群耦合运动建模与轨迹引导特征融合机制。首先，我们设计了集群运动感知轨迹预测模块，该模块从集群层面联合建模历史轨迹与姿态感知外观特征，从而更精准地预测非线性耦合的群体轨迹。其次，我们构建了轨迹引导时空特征融合模块，将预测位置与历史视觉线索对齐，并深度融合至当前帧特征中，增强弱目标的时间一致性与空间判别力。在AIRMOT、MOT-FLY和UAVSwarm三个公开空对空集群无人机跟踪数据集上的大量实验表明：当集成至同一多目标跟踪框架时，SMTP模块相比当前最先进的轨迹预测模块EqMotion实现了更精确的轨迹预测，并将IDF1指标提升了1.21%。总体而言，在复杂集群场景下，我们的SCT-MOT在多项评估指标上均持续优于现有先进跟踪器，展现出更优的准确性与鲁棒性。

摘要 (Abstract)

Air-to-air tracking of swarm UAVs presents significant challenges due to the complex nonlinear group motion and weak visual cues for small objects, which often cause detection failures, trajectory fragmentation, and identity switches. Although existing methods have attempted to improve performance by incorporating trajectory prediction, they model each object independently, neglecting the swarm-level motion dependencies. Their limited integration between motion prediction and appearance representation also weakens the spatio-temporal consistency required for tracking in visually ambiguous and cluttered environments, making it difficult to maintain coherent trajectories and reliable associations. To address these challenges, we propose SCT-MOT, a tracking framework that integrates Swarm-Coupled motion modeling and Trajectory-guided feature fusion. First, we develop a Swarm Motion-Aware Trajectory Prediction (SMTP) module jointly models historical trajectories and posture-aware appearance features from a swarm-level perspective, enabling more accurate forecasting of the nonlinear, coupled group trajectories. Second, we design a Trajectory-Guided Spatio-Temporal Feature Fusion (TG-STFF) module aligns predicted positions with historical visual cues and deeply integrates them with current frame features, enhancing temporal consistency and spatial discriminability for weak objects. Extensive experiments on three public air-to-air swarm UAV tracking datasets, including AIRMOT, MOT-FLY, and UAVSwarm, demonstrate that SMTP achieves more accurate trajectory forecasts and yields a 1.21% IDF1 improvement over the state-of-the-art trajectory prediction module EqMotion when integrated into the same MOT framework. Overall, our SCT-MOT consistently achieves superior accuracy and robustness compared to state-of-the-art trackers across multiple metrics under complex swarm scenarios.

关键词: Swarm UAV Tracking, Multiple Object Tracking, Trajectory Prediction, Spatio-Temporal Feature Fusion, Air-to-Air Tracking, Swarm-Coupled Motion, Trajectory Guidance, Visual Tracking

224. ❌ Vision-Language Model-Guided Deep Unrolling Enables Personalized, Fast MRI

作者: Fangmao Ju, Yuzhu He, Zhiwen Xue, Chunfeng Lian, Jianhua Ma 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06849v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文主要研究医学影像（MRI）加速，提出了一种结合视觉语言模型（VLM）和深度展开网络的智能框架PASS。论文核心是AI在科学（医学影像）领域的应用，因此与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。论文提到使用预训练的VLM，这涉及预训练概念，但与深度学习模型预训练相关，因此给’Pre-training OR Continual Pre-training OR Domain Adaptation’ 5分。论文强调可解释性和物理感知网络，与’Mechanistic Interpretability OR Explainable AI’有一定关联（5分）。其他关键词主要涉及大语言模型（LLM）的特定技术（如MoE、SFT、RAG、推理方法等），而论文使用的是视觉语言模型（VLM），专注于医学影像，而非通用大语言模型或相关技术，因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对MRI采集时间长的问题，提出了一种基于视觉语言模型引导的深度展开网络框架PASS，实现了个性化、任务导向的快速MRI成像，显著提升了图像质量和下游诊断任务性能。

摘要翻译

磁共振成像（MRI）是医学与医疗保健领域的基石技术，但其数据采集时间较长。传统加速MRI方法以通用图像质量为优化目标，缺乏针对特定临床任务的适应性。为此，我们提出PASS（个性化、异常感知采样与重建框架），这是一种智能MRI框架，其利用视觉语言模型（VLM）引导深度展开网络，实现面向任务的快速成像。PASS通过三个核心贡献动态个性化成像流程：（1）基于物理MRI模型构建的深度展开重建网络；（2）生成患者特异性$k$空间轨迹的采样模块；（3）从预训练VLM中提取的异常感知先验信息，该先验引导采样与重建聚焦于临床相关区域。通过将VLM的高层临床推理与可解释、物理感知的网络相结合，PASS在多种解剖结构、对比度、异常类型及加速因子下均实现了更优的图像质量。这一提升直接转化为下游诊断任务的性能改进，包括细粒度异常检测、定位与诊断。

摘要 (Abstract)

Magnetic Resonance Imaging (MRI) is a cornerstone in medicine and healthcare but suffers from long acquisition times. Traditional accelerated MRI methods optimize for generic image quality, lacking adaptability for specific clinical tasks. To address this, we introduce PASS (Personalized, Anomaly-aware Sampling and reconStruction), an intelligent MRI framework that leverages a Vision-Language Model (VLM) to guide a deep unrolling network for task-oriented, fast imaging. PASS dynamically personalizes the imaging pipeline through three core contributions: (1) a deep unrolled reconstruction network derived from a physics-based MRI model; (2) a sampling module that generates patient-specific $k$-space trajectories; and (3) an anomaly-aware prior, extracted from a pretrained VLM, which steers both sampling and reconstruction toward clinically relevant regions. By integrating the high-level clinical reasoning of a VLM with an interpretable, physics-aware network, PASS achieves superior image quality across diverse anatomies, contrasts, anomalies, and acceleration factors. This enhancement directly translates to improvements in downstream diagnostic tasks, including fine-grained anomaly detection, localization, and diagnosis.

关键词: Vision-Language Model, Deep Unrolling Network, Personalized MRI, Accelerated MRI, Anomaly-aware Prior, Task-oriented Imaging, Physics-based Reconstruction, Clinical Reasoning

作者: Dewei Zhou, You Li, Zongxin Yang, Yi Yang 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06870v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的图像编辑任务，提出了一种基于扩散模型的多模态区域细化方法（RefineAnything），用于在保持背景不变的情况下恢复局部细节。论文的核心技术涉及扩散模型、区域裁剪与粘贴策略、边界一致性损失等，属于图像生成与编辑的范畴。所有评分关键词均针对大语言模型（LLM）及其相关技术（如MoE、SFT、RLHF、RAG、推理加速、智能体等），或特定科学领域应用（如生物信息学）。论文未涉及任何语言模型、模型训练/微调方法、推理技术、智能体系统或科学AI应用，因此与所有关键词完全无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文提出了RefineAnything，一种基于扩散模型的多模态区域细化方法，通过Focus-and-Refine策略和边界一致性损失，在严格保持背景不变的前提下有效恢复图像局部细节，并在RefineEval基准上取得了优异性能。

摘要翻译

我们提出区域特异性图像精细化作为一个专门的问题设定：给定输入图像和用户指定区域（例如涂鸦掩码或边界框），目标是在严格保持所有未编辑像素不变的同时恢复细粒度细节。尽管图像生成领域进展迅速，现代模型仍常遭受局部细节坍塌问题（如扭曲的文本、标识和纤细结构）。现有的指令驱动编辑模型侧重于粗粒度语义编辑，往往忽略细微局部缺陷或无意中改变背景，尤其当感兴趣区域仅占固定分辨率输入的极小部分时。我们提出RefineAnything，这是一个基于多模态扩散的精细化模型，同时支持基于参考和无参考的精细化。基于一个反直觉的观察——在固定VAE输入分辨率下，裁剪并重采样能显著改善局部重建效果，我们提出聚焦-精细化策略（Focus-and-Refine）。这种区域聚焦的精细化-粘贴回策略通过将分辨率预算重新分配给目标区域，提升了精细化效果与效率，同时采用混合掩码粘贴回机制确保严格的背景保持。我们进一步引入边界感知的边界一致性损失函数，以减少接缝伪影并提升粘贴回的自然度。为支持这一新设定，我们构建了Refine-30K数据集（包含2万个基于参考样本和1万个无参考样本），并提出RefineEval基准测试，用于同步评估编辑区域保真度与背景一致性。在RefineEval测试中，RefineAnything相较于竞争基线模型实现了显著提升，并达到近乎完美的背景保持效果，为高精度局部精细化建立了实用解决方案。项目页面：https://limuloo.github.io/RefineAnything/。

摘要 (Abstract)

We introduce region-specific image refinement as a dedicated problem setting: given an input image and a user-specified region (e.g., a scribble mask or a bounding box), the goal is to restore fine-grained details while keeping all non-edited pixels strictly unchanged. Despite rapid progress in image generation, modern models still frequently suffer from local detail collapse (e.g., distorted text, logos, and thin structures). Existing instruction-driven editing models emphasize coarse-grained semantic edits and often either overlook subtle local defects or inadvertently change the background, especially when the region of interest occupies only a small portion of a fixed-resolution input. We present RefineAnything, a multimodal diffusion-based refinement model that supports both reference-based and reference-free refinement. Building on a counter-intuitive observation that crop-and-resize can substantially improve local reconstruction under a fixed VAE input resolution, we propose Focus-and-Refine, a region-focused refinement-and-paste-back strategy that improves refinement effectiveness and efficiency by reallocating the resolution budget to the target region, while a blended-mask paste-back guarantees strict background preservation. We further introduce a boundary-aware Boundary Consistency Loss to reduce seam artifacts and improve paste-back naturalness. To support this new setting, we construct Refine-30K (20K reference-based and 10K reference-free samples) and introduce RefineEval, a benchmark that evaluates both edited-region fidelity and background consistency. On RefineEval, RefineAnything achieves strong improvements over competitive baselines and near-perfect background preservation, establishing a practical solution for high-precision local refinement. Project Page: https://limuloo.github.io/RefineAnything/.

关键词: image refinement, region-specific editing, diffusion models, local detail restoration, background preservation, Focus-and-Refine, boundary consistency loss, multimodal refinement

226. ❌ CloudMamba: An Uncertainty-Guided Dual-Scale Mamba Network for Cloud Detection in Remote Sensing Imagery

作者: Jiajun Yang, Keyan Chen, Zhengxia Zou, Zhenwei Shi 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06844v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文《CloudMamba》专注于遥感图像中的云检测，提出了一种基于CNN-Mamba混合架构的双尺度Mamba网络和不确定性引导的两阶段检测策略。论文的核心是计算机视觉中的图像分割任务，使用了深度学习技术（CNN和Mamba架构），但并未涉及大语言模型（LLMs）、模型训练技术（如预训练、微调、对齐）、推理优化、智能体系统或模型可解释性等关键词。唯一的相关关键词是“AI for Science OR Bioinformatics OR Cheminformatics”，因为遥感图像分析可视为科学领域（地球科学、环境监测）的AI应用，但论文未明确提及生物信息学或化学信息学，因此给予5分（有一定关联）。其他关键词与论文内容完全无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为CloudMamba的新型深度学习框架，通过不确定性引导的两阶段策略和双尺度Mamba网络，解决了遥感图像中薄云区域模糊、碎片化云和边界细节的云检测难题，在公开数据集上实现了更高的分割精度和效率。

摘要翻译

遥感影像中的云检测是一项基础、关键且极具挑战性的问题。现有的基于深度学习的云检测方法通常将其表述为单阶段逐像素二值分割任务，通过一次前向传播完成。然而，此类单阶段方法在薄云区域存在模糊性和不确定性，且难以精确处理破碎云和边界细节。本文提出了一种新颖的深度学习框架，称为CloudMamba。为解决薄云区域的模糊性问题，我们引入了一种不确定性引导的两阶段云检测策略。提出了一个嵌入式不确定性估计模块，用于自动量化薄云分割的置信度，并引入第二阶段精细化分割以提升低置信度困难区域的准确性。为了更好地处理破碎云和细粒度边界细节，我们设计了一种基于CNN-Mamba混合架构的双尺度Mamba网络。与具有二次计算复杂度的基于Transformer的模型相比，所提方法在保持线性计算复杂度的同时，能有效捕捉云的大尺度结构特征和小尺度边界细节，从而实现整体云形态的准确勾勒和边界的精确分割。在GF1_WHU和Levir_CS公开数据集上进行的大量实验表明，所提方法在多个分割精度指标上均优于现有方法，同时具备高效率和过程透明性。我们的代码发布于https://github.com/jayoungo/CloudMamba。

摘要 (Abstract)

Cloud detection in remote sensing imagery is a fundamental, critical, and highly challenging problem. Existing deep learning-based cloud detection methods generally formulate it as a single-stage pixel-wise binary segmentation task with one forward pass. However, such single-stage approaches exhibit ambiguity and uncertainty in thin-cloud regions and struggle to accurately handle fragmented clouds and boundary details. In this paper, we propose a novel deep learning framework termed CloudMamba. To address the ambiguity in thin-cloud regions, we introduce an uncertainty-guided two-stage cloud detection strategy. An embedded uncertainty estimation module is proposed to automatically quantify the confidence of thin-cloud segmentation, and a second-stage refinement segmentation is introduced to improve the accuracy in low-confidence hard regions. To better handle fragmented clouds and fine-grained boundary details, we design a dual-scale Mamba network based on a CNN-Mamba hybrid architecture. Compared with Transformer-based models with quadratic computational complexity, the proposed method maintains linear computational complexity while effectively capturing both large-scale structural characteristics and small-scale boundary details of clouds, enabling accurate delineation of overall cloud morphology and precise boundary segmentation. Extensive experiments conducted on the GF1_WHU and Levir_CS public datasets demonstrate that the proposed method outperforms existing approaches across multiple segmentation accuracy metrics, while offering high efficiency and process transparency. Our code is available at https://github.com/jayoungo/CloudMamba.

关键词: Cloud Detection, Remote Sensing Imagery, Mamba Network, Uncertainty Estimation, Two-stage Segmentation, CNN-Mamba Hybrid, Dual-scale Architecture, Pixel-wise Segmentation

作者: Donghyeon Kwon, Taegyu Park, Suha Kwak 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06825v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究LiDAR语义分割的半监督学习方法，专注于伪标签优化和训练策略，属于计算机视觉和自动驾驶领域。所有评分关键词均涉及大模型、深度学习技术原理或AI科学应用，但论文未涉及任何大模型技术（如LLMs、MoE、RLHF等）、模型优化技术（如量化、推理加速）或AI科学应用（如生物信息学）。论文内容与所有关键词完全无关，因此所有关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对LiDAR语义分割中伪标签噪声导致的错误传播和确认偏差问题，提出了RePL框架，通过掩码重建识别和纠正伪标签错误，在nuScenes-lidarseg和SemanticKITTI数据集上实现了最先进的性能。

摘要翻译

激光雷达语义分割的半监督学习方法常因噪声伪标签导致的误差传播与确认偏误问题而受限。为应对这一长期存在的挑战，本文提出RePL框架，该框架通过掩码重建机制识别并修正伪标签中的潜在错误，从而显著提升伪标签质量，并配合专门设计的训练策略。我们进一步通过理论分析证明了伪标签优化过程有效的条件，并在实验中验证该条件较为宽松且RePL框架明确满足该条件。在nuScenes-lidarseg和SemanticKITTI数据集上的大量实验表明，RePL能大幅提升伪标签质量，进而在激光雷达语义分割任务中取得了最先进的性能。

摘要 (Abstract)

Semi-supervised learning for LiDAR semantic segmentation often suffers from error propagation and confirmation bias caused by noisy pseudo-labels. To tackle this chronic issue, we introduce RePL, a novel framework that enhances pseudo-label quality by identifying and correcting potential errors in pseudo-labels through masked reconstruction, along with a dedicated training strategy. We also provide a theoretical analysis demonstrating the condition under which the pseudo-label refinement is beneficial, and empirically confirm that the condition is mild and clearly met by RePL. Extensive evaluations on the nuScenes-lidarseg and SemanticKITTI datasets show that RePL improves pseudo-label quality a lot and, as a result, achieves the state of the art in LiDAR semantic segmentation.

关键词: LiDAR semantic segmentation, semi-supervised learning, pseudo-label refinement, masked reconstruction, error propagation, confirmation bias, state-of-the-art

228. ❌ VGGT-SLAM++

作者: Avilasha Mandal, Rajesh Kumar, Sudarshan Sunil Harithas, Chetan Arora 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06830v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《VGGT-SLAM++》专注于计算机视觉和机器人领域的视觉SLAM（同时定位与地图构建）系统，核心是改进基于Transformer的视觉几何模型（VGGT）在SLAM中的应用，涉及视觉里程计、数字高程图、图优化等技术。所有评分关键词均围绕大模型、深度学习技术原理及其在科学领域的应用，但论文内容完全不涉及语言模型、模型训练/微调方法、推理优化、对齐技术、智能体系统、模型压缩等主题，也未提及生物信息学或化学信息学等AI for Science的具体领域。因此，所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了VGGT-SLAM++，一个改进的视觉SLAM系统，通过融合VGGT Transformer和Sim(3)解算、基于DEM的图构建以及空间校正后端，实现了高精度的大规模建图，在标准基准测试中达到了最先进的准确性，显著减少了短期漂移并保持了全局一致性。

摘要翻译

我们推出VGGT-SLAM++，这是一个完整的视觉SLAM系统，它充分利用了视觉几何基础变换器（Visual Geometry Grounded Transformer，简称VGGT）输出的丰富几何信息。该系统包含一个融合VGGT前馈变换器与Sim(3)解算的视觉里程计（前端）、一个基于数字高程模型（Digital Elevation Model，简称DEM）的图构建模块，以及一个后端，共同实现了内存占用可控的大规模精确建图。此前基于变换器的SLAM流程（如VGGT-SLAM）主要依赖稀疏回环检测或全局Sim(3)流形约束——这可能导致短时位姿漂移——而VGGT-SLAM++通过空间校正后端恢复了高频次的局部光束法平差（Local Bundle Adjustment，简称LBA）。对于每个VGGT子地图，我们构建一个稠密的平面规范DEM，将其划分为多个图块，并计算其DINOv2嵌入特征，从而将子地图整合到共视图（covisibility graph）中。在共视窗口内，通过视觉位置识别（Visual Place Recognition，简称VPR）模块检索空间邻近区域，触发频繁的局部优化以稳定轨迹。在标准SLAM基准测试中，VGGT-SLAM++实现了最先进的精度，显著减少了短期漂移，加速了图收敛，并通过紧凑的DEM图块和亚线性检索保持了全局一致性。

摘要 (Abstract)

We introduce VGGT-SLAM++, a complete visual SLAM system that leverages the geometry-rich outputs of the Visual Geometry Grounded Transformer (VGGT). The system comprises a visual odometry (front-end) fusing the VGGT feed-forward transformer and a Sim(3) solution, a Digital Elevation Map (DEM)-based graph construction module, and a back-end that jointly enable accurate large-scale mapping with bounded memory. While prior transformer-based SLAM pipelines such as VGGT-SLAM rely primarily on sparse loop closures or global Sim(3) manifold constraints - allowing short-horizon pose drift - VGGT-SLAM++ restores high-cadence local bundle adjustment (LBA) through a spatially corrective back-end. For each VGGT submap, we construct a dense planar-canonical DEM, partition it into patches, and compute their DINOv2 embeddings to integrate the submap into a covisibility graph. Spatial neighbors are retrieved using a Visual Place Recognition (VPR) module within the covisibility window, triggering frequent local optimization that stabilizes trajectories. Across standard SLAM benchmarks, VGGT-SLAM++ achieves state-of-the-art accuracy, substantially reducing short-term drift, accelerating graph convergence, and maintaining global consistency with compact DEM tiles and sublinear retrieval.

关键词: visual SLAM, VGGT, transformer, digital elevation map, bundle adjustment, visual place recognition, pose drift, covisibility graph

229. ❌ Generate, Analyze, and Refine: Training-Free Sound Source Localization via MLLM Meta-Reasoning

作者: Subin Park, Jung Uk Kim 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06824v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心是利用多模态大语言模型（MLLMs）的推理能力进行训练免费的声源定位，因此与’Large Language Models’高度相关（10分）。论文提出的GAR框架涉及生成、分析、精炼的元认知过程，与’Chain of Thought’、‘System 2 Thinking’和’Self-Correction’等推理相关关键词有一定关联（8分）。论文属于AI在科学领域的应用，与’AI for Science’有一定关联（5分）。其他关键词如MoE、SFT、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于多模态大语言模型元认知推理的训练免费声源定位框架GAR，通过生成-分析-精炼三阶段流程，在单源和多源基准测试中取得了有竞争力的性能。

摘要翻译

声源定位任务旨在通过利用音频与视觉模态间的关联性识别发声物体的位置。现有大多数声源定位方法依赖于基于对比学习的特征匹配，但缺乏显式推理与验证机制，限制了其在复杂声学场景中的有效性。受人类元认知过程的启发，我们提出一种无需训练的声源定位框架，该框架利用多模态大语言模型固有的推理能力。我们设计的生成-分析-优化流程包含三个阶段：生成阶段产生初始边界框与音频分类；分析阶段通过开放集角色标注与锚点投票量化视听一致性；优化阶段采用自适应门控机制以避免不必要的调整。在单声源与多声源基准测试上的大量实验表明，该方法具有竞争力的性能。源代码公开于 https://github.com/VisualAIKHU/GAR-SSL。

摘要 (Abstract)

Sound source localization task aims to identify the locations of sound-emitting objects by leveraging correlations between audio and visual modalities. Most existing SSL methods rely on contrastive learning-based feature matching, but lack explicit reasoning and verification, limiting their effectiveness in complex acoustic scenes. Inspired by human meta-cognitive processes, we propose a training-free SSL framework that exploits the intrinsic reasoning capabilities of Multimodal Large Language Models (MLLMs). Our Generation-Analysis-Refinement (GAR) pipeline consists of three stages: Generation produces initial bounding boxes and audio classifications; Analysis quantifies Audio-Visual Consistency via open-set role tagging and anchor voting; and Refinement applies adaptive gating to prevent unnecessary adjustments. Extensive experiments on single-source and multi-source benchmarks demonstrate competitive performance. The source code is available at https://github.com/VisualAIKHU/GAR-SSL.

关键词: Sound Source Localization, Multimodal Large Language Models, Training-Free Framework, Meta-Reasoning, Audio-Visual Consistency, Generation-Analysis-Refinement, MLLMs, SSL

230. ❌ Enhanced Self-Supervised Multi-Image Super-Resolution for Camera Array Images

作者: Yating Chen, Feng Huang, Xianyu Wu, Jing Wu, Ying Shen 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06816v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的多图像超分辨率技术，提出了一种结合Multi-to-Single和Multi-to-Multi的自监督学习框架以及双Transformer网络。论文内容完全围绕图像处理、自监督学习和Transformer架构展开，未涉及任何大语言模型、深度学习技术原理创新或科学领域应用的相关关键词。所有评分关键词均与大模型技术、对齐方法、推理机制、代理系统等主题相关，与该论文的计算机视觉研究方向无直接关联。

!!! tip deepseek-chat TL;DR

该论文针对相机阵列图像的多图像超分辨率问题，提出了一种结合Multi-to-Single和Multi-to-Multi的自监督学习框架以及双Transformer网络，有效提升了图像恢复的视觉质量和纹理细节。

摘要翻译

传统的多图像超分辨率方法（如连拍超分辨率与视频超分辨率）依赖于单一相机采集的序列帧，因此面临复杂的图像退化与严重遮挡问题，增加了图像精准复原的难度。相比之下，多孔径相机阵列成像通过空间分布的视角进行采集，其采样偏移构成稳定的盘状分布，从而增强了观测数据的非冗余性。现有的多图像超分辨率算法未能充分利用这些独特特性。有监督的多图像超分辨率方法容易对训练数据中的退化模式过拟合，而当前的自监督学习方法在恢复细粒度细节方面仍存在困难。为解决这些问题，本文深入研究了“多图像到单图像”与“多图像到多图像”两类自监督学习方法的优势、局限及适用边界。我们提出了“多图像到单图像引导的多图像到多图像自监督学习框架”，该框架结合了两种范式的优点，能够生成视觉效果好、保真度高且纹理细节丰富的图像。这一框架为深度融合神经网络与经典基于物理的变分方法提供了新范式。为提升多图像超分辨率网络从混叠伪影中恢复高频细节的能力，本文提出了一种适用于自监督学习的新型相机阵列超分辨率网络——双Transformer结构。在合成数据集与真实数据集上的实验验证了所提方法的优越性。

摘要 (Abstract)

Conventional multi-image super-resolution (MISR) methods, such as burst and video SR, rely on sequential frames from a single camera. Consequently, they suffer from complex image degradation and severe occlusion, increasing the difficulty of accurate image restoration. In contrast, multi-aperture camera-array imaging captures spatially distributed views with sampling offsets forming a stable disk-like distribution, which enhances the non-redundancy of observed data. Existing MISR algorithms fail to fully exploit these unique properties. Supervised MISR methods tend to overfit the degradation patterns in training data, and current self-supervised learning (SSL) techniques struggle to recover fine-grained details. To address these issues, this paper thoroughly investigates the strengths, limitations and applicability boundaries of multi-image-to-single-image (Multi-to-Single) and multi-image-to-multi-image (Multi-to-Multi) SSL methods. We propose the Multi-to-Single-Guided Multi-to-Multi SSL framework that combines the advantages of Multi-to-Single and Multi-to-Multi to generate visually appealing and high-fidelity images rich in texture details. The Multi-to-Single-Guided Multi-to-Multi SSL framework provides a new paradigm for integrating deep neural network with classical physics-based variational methods. To enhance the ability of MISR network to recover high-frequency details from aliased artifacts, this paper proposes a novel camera-array SR network called dual Transformer suitable for SSL. Experiments on synthetic and real-world datasets demonstrate the superiority of the proposed method.

关键词: multi-image super-resolution, camera array, self-supervised learning, Transformer, Multi-to-Single, Multi-to-Multi, image restoration, high-fidelity images

231. ❌ Insights from Visual Cognition: Understanding Human Action Dynamics with Overall Glance and Refined Gaze Transformer

作者: Bohao Xing, Deng Li, Rong Gao, Xin Liu, Heikki Kälviäinen 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06783v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域，提出了一种基于Transformer的双路径网络（OG-ReG Transformer）用于视频理解任务，旨在平衡计算效率与时空相关性建模。论文内容与所有评分关键词（均围绕大语言模型、深度学习技术原理及其在科学领域的应用）完全无关，未涉及任何大模型、语言模型、对齐、推理、代理、压缩等技术或科学应用。

!!! tip deepseek-chat TL;DR

该论文提出了一种受人类视觉系统启发的双路径Transformer网络（OG-ReG Transformer），用于视频理解任务，通过Glance路径提取粗粒度时空信息和Gaze路径补充局部细节，在多个基准数据集上取得了最先进的性能。

摘要翻译

近年来，Transformer在各种视觉任务中取得了显著进展。为平衡视频任务中的计算量与效率，现有研究多依赖于因子化或基于窗口的自注意力机制。然而，这些方法割裂了视频中感兴趣区域间的时空关联，限制了模型捕捉运动信息与长程依赖的能力。本文认为，与人类视觉系统类似，时空信息的重要性在不同时间尺度上存在差异，注意力通过扫视与凝视行为在时间维度上稀疏分配。对时间与空间给予同等考量是否对视频任务的成功至关重要？基于这一认知，我们提出一种双路径网络——全局扫视与精细凝视（OG-ReG）Transformer。扫视路径提取粗粒度的整体时空信息，而凝视路径通过提供局部细节对扫视路径进行补充。我们的模型在Kinetics-400、Something-Something v2和Diving-48数据集上取得了最先进的结果，证明了其卓越性能。代码将在https://github.com/linuxsino/OG-ReG公开。

摘要 (Abstract)

Recently, Transformer has made significant progress in various vision tasks. To balance computation and efficiency in video tasks, recent works heavily rely on factorized or window-based self-attention. However, these approaches split spatiotemporal correlations between regions of interest in videos, limiting the models’ ability to capture motion and long-range dependencies. In this paper, we argue that, similar to the human visual system, the importance of temporal and spatial information varies across different time scales, and attention is allocated sparsely over time through glance and gaze behavior. Is equal consideration of time and space crucial for success in video tasks? Motivated by this understanding, we propose a dual-path network called the Overall Glance and Refined Gaze (OG-ReG) Transformer. The Glance path extracts coarse-grained overall spatiotemporal information, while the Gaze path supplements the Glance path by providing local details. Our model achieves state-of-the-art results on the Kinetics-400, Something-Something v2, and Diving-48, demonstrating its competitive performance. The code will be available at https://github.com/linuxsino/OG-ReG.

关键词: Transformer, Video Understanding, Spatiotemporal Attention, Human Visual System, Dual-path Network, Action Recognition, Computational Efficiency, State-of-the-art Performance

232. ❌ Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization

作者: Wenhao Yang, Yu Xia, Jinlong Huang, Shiyin Lu, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, Yuchen Zhou, Xiaobo Xia, Yuanyu Wan, Lijun Zhang, Tat-Seng Chua 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06777v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究多模态大语言模型（MLLMs）在视觉推理任务中的强化学习优化方法，与以下关键词高度相关：1）‘Large Language Models’（论文明确研究MLLMs）；2）‘Chain of Thought’（论文提出Multimodal Chain-of-Thought，MCoT）；3）‘LLM Agents’（论文研究agentic reasoning trajectories和agentic policy optimization）；4）‘Tool Use’（论文涉及视觉工具调用）。与’System 2 Thinking’和’Self-Correction’有一定关联，因为研究多步推理和减少执行错误；与’Hallucination Mitigation’部分相关，因为解决推理-行动差异导致的错误累积。其他关键词如MoE、量化、科学AI应用等未在论文中涉及。

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型在视觉推理中存在的文本推理与视觉行动执行不一致问题，提出了Multimodal Agentic Policy Optimization（MAPO）方法，通过耦合语义对齐和任务奖励的优势估计，显著提升了多个视觉推理基准的性能。

摘要翻译

多模态大语言模型（MLLMs）的最新进展鼓励模型在多轮推理过程中通过主动调用视觉工具来“借助图像思考”。目前常见的强化学习（RL）实践依赖于基于结果的奖励，却忽略了文本合理性往往掩盖了执行失败这一事实，这意味着模型在其代理推理轨迹中可能表现出直观的文本推理，同时却执行着不精确或不相关的视觉动作。这种推理与行动之间的差异引入了噪声，并在多轮推理过程中不断累积，严重削弱了模型的多模态推理能力，并可能导致训练崩溃。本文提出了多模态代理策略优化（Multimodal Agentic Policy Optimization, MAPO），旨在弥合模型在其多模态思维链（Multimodal Chain-of-Thought, MCoT）中产生的文本推理与视觉动作之间的差距。具体而言，MAPO要求模型对通过工具使用获得的视觉内容生成明确的文本描述。随后，我们采用一种新颖的优势估计方法，将上述描述与实际观察之间的语义对齐与任务奖励相结合。本文提供了理论分析以论证MAPO的基本原理，该方法本质上降低了梯度的方差，大量实验也证明我们的方法在多个视觉推理基准测试中取得了卓越的性能。

摘要 (Abstract)

Recent advancements in Multimodal Large Language Models (MLLMs) have incentivized models to ``think with images’’ by actively invoking visual tools during multi-turn reasoning. The common Reinforcement Learning (RL) practice of relying on outcome-based rewards ignores the fact that textual plausibility often masks executive failure, meaning that models may exhibit intuitive textual reasoning while executing imprecise or irrelevant visual actions within their agentic reasoning trajectories. This reasoning-action discrepancy introduces noise that accumulates throughout the multi-turn reasoning process, severely degrading the model’s multimodal reasoning capabilities and potentially leading to training collapse. In this paper, we introduce Multimodal Agentic Policy Optimization (MAPO), bridging the gap between textual reasoning and visual actions generated by models within their Multimodal Chain-of-Thought (MCoT). Specifically, MAPO mandates the model to generate explicit textual descriptions for the visual content obtained via tool usage. We then employ a novel advantage estimation that couples the semantic alignment between these descriptions and the actual observations with the task reward. Theoretical findings are provided to justify the rationale behind MAPO, which inherently reduces the variance of gradients, and extensive experiments demonstrate that our method achieves superior performance across multiple visual reasoning benchmarks.

关键词: Multimodal Large Language Models, Multimodal Chain-of-Thought, Agentic Reasoning, Visual Tool Usage, Reinforcement Learning, Policy Optimization, Reasoning-Action Gap, Visual Reasoning Benchmarks

233. ❌ FlowInOne:Unifying Multimodal Generation as Image-in, Image-out Flow Matching

作者: Junchao Yi, Rui Zhao, Jiahao Tang, Weixian Lei, Linjie Li, Qisheng Su, Zhengyuan Yang, Lijuan Wang, Xiaofeng Zhu, Alex Jinpeng Wang 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06757v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于视觉生成模型FlowInOne，提出了一种纯视觉的流匹配框架，将多模态输入统一为视觉提示，实现图像输入-图像输出的生成流程。虽然属于大模型在生成领域的应用，但论文核心是视觉生成技术（flow matching、视觉提示、图像生成），而非语言模型技术。所有评分关键词均针对语言模型（LLM）的技术原理、训练方法、推理优化、对齐、应用范式等，与本文的视觉生成模型技术路线无直接关联。因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文挑战了以文本为主导的多模态生成范式，提出了FlowInOne框架，将文本、布局、编辑指令等所有输入统一为视觉提示，通过单一的流匹配模型实现图像输入-图像输出的生成，在文本到图像生成、布局引导编辑和视觉指令跟随等任务上取得了最先进的性能。

摘要翻译

长期以来，多模态生成领域一直由文本驱动的流程主导，其中语言支配视觉，却无法在其中进行推理或创造。我们通过质疑是否所有模态——包括文本描述、空间布局和编辑指令——都能统一为单一的视觉表征，来挑战这一范式。我们提出了FlowInOne框架，该框架将多模态生成重新定义为纯粹的视觉流，将所有输入转化为视觉提示，并构建了一个由单一流匹配模型控制的简洁“图像输入-图像输出”流程。这种以视觉为中心的表述方式，自然消除了跨模态对齐瓶颈、噪声调度以及任务特定的架构分支，将文本到图像生成、布局引导编辑和视觉指令跟随统一在一个连贯的范式之下。为此，我们引入了VisPrompt-5M，这是一个包含500万个视觉提示对的大规模数据集，涵盖了物理感知的力动力学和轨迹预测等多种任务；同时我们还构建了VP-Bench，这是一个精心策划的基准测试，用于评估指令遵循度、空间精度、视觉真实性和内容一致性。大量实验表明，FlowInOne在所有统一的生成任务中均实现了最先进的性能，超越了开源模型和具有竞争力的商业系统，为完全以视觉为中心的生成建模奠定了新的基础，在这种建模中，感知与创造共存于一个连续的视觉空间之内。

摘要 (Abstract)

Multimodal generation has long been dominated by text-driven pipelines where language dictates vision but cannot reason or create within it. We challenge this paradigm by asking whether all modalities, including textual descriptions, spatial layouts, and editing instructions, can be unified into a single visual representation. We present FlowInOne, a framework that reformulates multimodal generation as a purely visual flow, converting all inputs into visual prompts and enabling a clean image-in, image-out pipeline governed by a single flow matching model. This vision-centric formulation naturally eliminates cross-modal alignment bottlenecks, noise scheduling, and task-specific architectural branches, unifying text-to-image generation, layout-guided editing, and visual instruction following under one coherent paradigm. To support this, we introduce VisPrompt-5M, a large-scale dataset of 5 million visual prompt pairs spanning diverse tasks including physics-aware force dynamics and trajectory prediction, alongside VP-Bench, a rigorously curated benchmark assessing instruction faithfulness, spatial precision, visual realism, and content consistency. Extensive experiments demonstrate that FlowInOne achieves state-of-the-art performance across all unified generation tasks, surpassing both open-source models and competitive commercial systems, establishing a new foundation for fully vision-centric generative modeling where perception and creation coexist within a single continuous visual space.

关键词: FlowInOne, multimodal generation, flow matching, visual prompts, image-in image-out, vision-centric, VisPrompt-5M, VP-Bench

234. ❌ LiveStre4m: Feed-Forward Live Streaming of Novel Views from Unposed Multi-View Video

作者: Pedro Quesado, Erkut Akdag, Yasaman Kashefbahrami, Willem Menu, Egor Bondarev 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06740v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是计算机视觉领域的实时新视角合成技术，具体涉及多视角视频处理、3D场景重建、相机姿态预测和扩散变换器插值。论文内容完全专注于视觉表示学习和实时系统优化，未涉及任何大语言模型、深度学习技术原理创新或AI在科学领域的应用。所有评分关键词均与大语言模型、深度学习技术原理或AI for Science相关，与该论文的计算机视觉研究主题无直接关联，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为LiveStre4m的前馈模型，解决了从未标定多视角视频实时流式合成新视角的挑战，通过多视角视觉变换器和扩散变换器插值模块实现了实时、时间一致的新视角视频生成，将每帧重建时间从约2.67秒大幅减少到0.07秒。

摘要翻译

从未标定多视角视频进行实时直播式新视角合成在众多应用中仍是一个开放挑战。现有动态场景表征方法通常需要真实相机参数并涉及耗时的优化过程（约2.67秒），这使其不适用于直播场景。为解决该问题，我们提出一种新颖的视点视频直播方法（LiveStre4m），这是一种基于前馈网络的实时新视角合成模型，可直接处理未标定的稀疏多视角输入。LiveStre4m采用多视角视觉transformer进行关键帧三维场景重建，并结合扩散transformer插值模块确保时间一致性与稳定流传输。此外，我们提出相机姿态预测模块，可直接从RGB图像高效估计相机外参与内参，消除了对已知相机标定信息的依赖。该方法仅需两个同步的未标定输入流即可实现时间一致的实时新视角视频流传输。LiveStre4m在1024×768分辨率下达到平均每帧0.07秒的重建速度，在运行时间上比基于优化的动态场景表征方法快数个数量级。这些结果表明LiveStre4m使实时新视角合成流传输在实际应用中成为可能，标志着可部署的实时新视角合成系统迈出重要一步。代码发布于：https://github.com/pedro-quesado/LiveStre4m

摘要 (Abstract)

Live-streaming Novel View Synthesis (NVS) from unposed multi-view video remains an open challenge in a wide range of applications. Existing methods for dynamic scene representation typically require ground-truth camera parameters and involve lengthy optimizations ($\approx 2.67$s), which makes them unsuitable for live streaming scenarios. To address this issue, we propose a novel viewpoint video live-streaming method (LiveStre4m), a feed-forward model for real-time NVS from unposed sparse multi-view inputs. LiveStre4m introduces a multi-view vision transformer for keyframe 3D scene reconstruction coupled with a diffusion-transformer interpolation module that ensures temporal consistency and stable streaming. In addition, a Camera Pose Predictor module is proposed to efficiently estimate both poses and intrinsics directly from RGB images, removing the reliance on known camera calibration information. Our approach enables temporally consistent novel-view video streaming in real-time using as few as two synchronized unposed input streams. LiveStre4m attains an average reconstruction time of $ 0.07$s per-frame at $ 1024 \times 768$ resolution, outperforming the optimization-based dynamic scene representation methods by orders of magnitude in runtime. These results demonstrate that LiveStre4m makes real-time NVS streaming feasible in practical settings, marking a substantial step toward deployable live novel-view synthesis systems. Code available at: https://github.com/pedro-quesado/LiveStre4m

关键词: novel view synthesis, live streaming, multi-view video, feed-forward model, camera pose prediction, diffusion-transformer, real-time reconstruction, unposed inputs

235. ❌ DOC-GS: Dual-Domain Observation and Calibration for Reliable Sparse-View Gaussian Splatting

作者: Hantang Li, Qiang Zhu, Xiandong Meng, Debin Zhao, Xiaopeng Fan 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06739v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《DOC-GS: Dual-Domain Observation and Calibration for Reliable Sparse-View Gaussian Splatting》专注于计算机视觉中的3D重建问题，特别是针对稀疏视图下3D高斯溅射（3DGS）的过拟合和伪影问题。论文的核心贡献是提出了一个双域观测与校准框架（DOC-GS），通过优化域和观测域的方法来建模和校正高斯基元的可靠性。所有给定的评分关键词均与大语言模型（LLMs）、深度学习技术原理或AI在科学领域的应用直接相关，而本文研究的是3D计算机视觉中的特定重建技术，不涉及任何大语言模型、深度学习技术原理创新或AI在生物医药等科学领域的应用。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对稀疏视图下3D高斯溅射重建中因几何监督不足导致的过拟合和伪影问题，提出了一个双域观测与校准框架（DOC-GS），通过优化域的连续深度引导丢弃策略和观测域的暗通道先验证据积累，有效识别并移除了不可靠的高斯基元，从而提升了重建的可靠性和质量。

摘要翻译

基于三维高斯泼溅（3DGS）的稀疏视角重建由于几何监督不足，本质上是一个不适定问题，常导致严重的过拟合以及结构畸变和半透明雾状伪影的出现。现有方法试图通过基于随机丢弃的正则化缓解此问题，但这些方法大多具有启发性，且缺乏对伪影形成机制的统一理解。本文从一个新视角重新审视稀疏视角3DGS重建，将核心挑战归结为高斯基元可靠性的不可观测性。不可靠的高斯分布在优化过程中约束不足，并在渲染图像中累积形成雾状退化。基于这一观察，我们提出了一个统一的双域观测与校准框架（DOC-GS），通过优化域归纳偏置与观测域证据的协同作用，对高斯可靠性进行建模与校正。具体而言，在优化域中，我们通过每个基元在训练过程中受约束的程度来表征高斯可靠性，并通过连续深度引导随机丢弃策略（CDGD）实例化该信号——其中丢弃概率作为基元可靠性的显式代理。这施加了平滑的深度感知归纳偏置，以抑制弱约束高斯分布并提升优化稳定性。在观测域中，我们建立了漂浮伪影与大气散射之间的联系，并利用暗通道先验（Dark Channel Prior, DCP）作为结构一致性线索来识别并累积异常区域。基于跨视角聚合的证据，我们进一步设计了可靠性驱动的几何剪枝策略，以移除低置信度的高斯分布。

摘要 (Abstract)

Sparse-view reconstruction with 3D Gaussian Splatting (3DGS) is fundamentally ill-posed due to insufficient geometric supervision, often leading to severe overfitting and the emergence of structural distortions and translucent haze-like artifacts. While existing approaches attempt to alleviate this issue via dropout-based regularization, they are largely heuristic and lack a unified understanding of artifact formation. In this paper, we revisit sparse-view 3DGS reconstruction from a new perspective and identify the core challenge as the unobservability of Gaussian primitive reliability. Unreliable Gaussians are insufficiently constrained during optimization and accumulate as haze-like degradations in rendered images. Motivated by this observation, we propose a unified Dual-domain Observation and Calibration (DOC-GS) framework that models and corrects Gaussian reliability through the synergy of optimization-domain inductive bias and observation-domain evidence. Specifically, in the optimization domain, we characterize Gaussian reliability by the degree to which each primitive is constrained during training, and instantiate this signal via a Continuous Depth-Guided Dropout (CDGD) strategy, where the dropout probability serves as an explicit proxy for primitive reliability. This imposes a smooth depth-aware inductive bias to suppress weakly constrained Gaussians and improve optimization stability. In the observation domain, we establish a connection between floater artifacts and atmospheric scattering, and leverage the Dark Channel Prior (DCP) as a structural consistency cue to identify and accumulate anomalous regions. Based on cross-view aggregated evidence, we further design a reliability-driven geometric pruning strategy to remove low-confidence Gaussians.

关键词: 3D Gaussian Splatting, Sparse-view Reconstruction, Overfitting, Artifact Removal, Dual-domain Framework, Reliability Calibration, Dark Channel Prior, Geometric Pruning

236. ❌ Enhancing MLLM Spatial Understanding via Active 3D Scene Exploration for Multi-Perspective Reasoning

作者: Jiahua Chen, Qihong Tang, Weinong Wang, Qi Fan 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06725v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	8.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	8.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文聚焦于多模态大语言模型（MLLMs）在3D空间理解上的创新，核心贡献是提出了一种免训练的视觉思维链机制，通过3D重建和主动视角探索来增强空间推理能力。高度相关的关键词包括：1）Large Language Models（10分）- 论文直接研究MLLMs；2）Chain of Thought（10分）- 提出Visual Chain-of-Thought机制；3）System 2 Thinking（8分）- 涉及多步推理和视角模拟；4）LLM Agents（8分）- 框架具有自主探索特性；5）Tool Use（8分）- 利用外部知识库和工具进行3D重建。其他关键词如MoE、SLMs、Scaling Laws、训练方法等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型在复杂3D空间推理上的不足，提出了一种基于视觉思维链和主动3D场景探索的免训练框架，显著提升了模型的空间理解能力并在多个基准测试中超越了现有模型。

摘要翻译

尽管多模态大语言模型已取得显著进展，但由于依赖二维视觉先验，其在复杂三维空间推理方面仍面临挑战。现有方法通常通过在有限三维数据集上进行计算成本高昂的后训练，或借助缺乏显式几何理解与视角灵活性的僵化工具调用机制来缓解这一局限。为应对这些挑战，我们提出一种免训练的框架，引入基于显式三维重建的视觉思维链机制。该流程首先通过多模态大语言模型引导的多粒度关键词提取与掩码生成，从单张图像重建高保真三维网格；随后，框架借助外部知识库迭代计算最优相机外参并合成新视角，从而模拟人类视角采择能力。大量实验表明，所提方法显著提升了空间理解能力。具体而言，在3DSRBench和Rel3D等主流基准测试中，该框架的表现优于专用空间模型及通用多模态大语言模型（包括GPT-5.2和Gemini-2.5-Flash）。

摘要 (Abstract)

Although Multimodal Large Language Models have achieved remarkable progress, they still struggle with complex 3D spatial reasoning due to the reliance on 2D visual priors. Existing approaches typically mitigate this limitation either through computationally expensive post-training procedures on limited 3D datasets or through rigid tool-calling mechanisms that lack explicit geometric understanding and viewpoint flexibility. To address these challenges, we propose a \textit{training-free} framework that introduces a Visual Chain-of-Thought mechanism grounded in explicit 3D reconstruction. The proposed pipeline first reconstructs a high-fidelity 3D mesh from a single image using MLLM-guided keyword extraction and mask generation at multiple granularities. Subsequently, the framework leverages an external knowledge base to iteratively compute optimal camera extrinsic parameters and synthesize novel views, thereby emulating human perspective-taking. Extensive experiments demonstrate that the proposed approach significantly enhances spatial comprehension. Specifically, the framework outperforms specialized spatial models and general-purpose MLLMs, including \textit{GPT-5.2} and \textit{Gemini-2.5-Flash}, on major benchmarks such as 3DSRBench and Rel3D.

关键词: Multimodal Large Language Models, 3D spatial reasoning, Visual Chain-of-Thought, 3D reconstruction, active scene exploration, multi-perspective reasoning, training-free framework, viewpoint synthesis

237. ❌ Exploring 6D Object Pose Estimation with Deformation

作者: Zhiqiang Liu, Rui Song, Duanmu Chuangqi, Jiaojiao Li, David Ferstl, Yinlin Hu 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06720v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的6D物体姿态估计，特别是针对变形物体的数据集创建和基准测试。论文内容涉及3D扫描、RGB-D数据、姿态标注、SLAM系统等计算机视觉技术，与所有提供的大模型和深度学习技术原理关键词（如LLMs、MoE、RLHF、RAG等）以及AI for Science应用关键词均无直接关联。论文未提及任何语言模型、模型训练、推理优化、对齐技术或科学AI应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了DeSOPE数据集，用于解决6D物体姿态估计中物体变形导致的性能下降问题，并通过实验表明现有方法在物体变形时性能显著降低。

摘要翻译

我们提出DeSOPE——一个面向六自由度变形物体的大规模数据集。大多数六维物体姿态估计方法假设物体是刚性或关节式的，然而在实际应用中，由于磨损、碰撞或形变，物体会偏离其标准形状，这一假设往往失效。为建模此类情况，我们引入了DeSOPE数据集，该数据集包含26个常见物体类别的高保真三维扫描，每个类别均采集了一个标准形态和三种变形构型，并提供了与标准网格精确配准的三维注册数据。此外，数据集还包含一个RGB-D数据集，涵盖多样化场景下的13.3万帧图像，以及通过半自动流程生成的66.5万组姿态标注。我们首先为每个实例标注二维掩码，随后使用物体姿态估计方法计算初始姿态，通过物体级SLAM系统进行优化，最后经人工核验生成最终标注。我们对多种物体姿态估计方法进行了评估，发现其性能随形变程度增加而显著下降，这表明在实际应用中稳健处理此类形变至关重要。项目页面与数据集可通过https://desope-6d.github.io/访问。

摘要 (Abstract)

We present DeSOPE, a large-scale dataset for 6DoF deformed objects. Most 6D object pose methods assume rigid or articulated objects, an assumption that fails in practice as objects deviate from their canonical shapes due to wear, impact, or deformation. To model this, we introduce the DeSOPE dataset, which features high-fidelity 3D scans of 26 common object categories, each captured in one canonical state and three deformed configurations, with accurate 3D registration to the canonical mesh. Additionally, it features an RGB-D dataset with 133K frames across diverse scenarios and 665K pose annotations produced via a semi-automatic pipeline. We begin by annotating 2D masks for each instance, then compute initial poses using an object pose method, refine them through an object-level SLAM system, and finally perform manual verification to produce the final annotations. We evaluate several object pose methods and find that performance drops sharply with increasing deformation, suggesting that robust handling of such deformations is critical for practical applications. The project page and dataset are available at https://desope-6d.github.io/}{https://desope-6d.github.io/.

关键词: 6D object pose estimation, deformed objects, dataset, RGB-D, 3D registration, SLAM, pose annotation, benchmark

238. ❌ Improving Local Feature Matching by Entropy-inspired Scale Adaptability and Flow-endowed Local Consistency

作者: Ke Jin, Jiming Chen, Qi Ye 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06713v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是计算机视觉领域的半稠密图像匹配问题，具体针对尺度差异和局部一致性问题提出了改进方法。论文内容完全聚焦于传统计算机视觉技术（特征匹配、光流估计等），不涉及任何大语言模型、深度学习技术原理创新或AI在科学领域的应用。所有评分关键词均与大模型、深度学习技术或AI科学应用相关，与该论文的计算机视觉图像匹配研究无任何关联。

!!! tip deepseek-chat TL;DR

该论文针对半稠密图像匹配中存在的尺度差异和局部一致性问题，提出了基于熵启发尺度适应性和流赋予局部一致性的改进方法，显著提升了匹配性能。

摘要翻译

近期半稠密图像匹配方法取得了显著成功，但两个长期存在的问题仍制约其性能。在粗匹配阶段，现有方法采用的互最近邻匹配层存在过度排除问题，导致其难以处理图像间存在尺度差异的情况。为此，我们系统性地重新审视了匹配机制，并发现分数矩阵中隐含的信息可用于指示尺度比例。基于这一观察，我们提出了一种尺度感知匹配模块，该模块在引入可忽略计算开销的同时展现出卓越的有效性。在精匹配阶段，我们指出现有方法忽视了最终匹配结果的局部一致性，这削弱了其鲁棒性。为此，我们不再独立预测每个源像素的对应关系，而是将精匹配阶段重新定义为级联光流优化问题，并引入一种新颖的梯度损失函数以增强流场的局部一致性。大量实验表明，结合上述改进的全新匹配流程，在下游任务中实现了鲁棒且精确的匹配性能。

摘要 (Abstract)

Recent semi-dense image matching methods have achieved remarkable success, but two long-standing issues still impair their performance. At the coarse stage, the over-exclusion issue of their mutual nearest neighbor (MNN) matching layer makes them struggle to handle cases with scale difference between images. To this end, we comprehensively revisit the matching mechanism and make a key observation that the hint concealed in the score matrix can be exploited to indicate the scale ratio. Based on this, we propose a scale-aware matching module which is exceptionally effective but introduces negligible overhead. At the fine stage, we point out that existing methods neglect the local consistency of final matches, which undermines their robustness. To this end, rather than independently predicting the correspondence for each source pixel, we reformulate the fine stage as a cascaded flow refinement problem and introduce a novel gradient loss to encourage local consistency of the flow field. Extensive experiments demonstrate that our novel matching pipeline, with these proposed modifications, achieves robust and accurate matching performance on downstream tasks.

关键词: local feature matching, scale adaptability, local consistency, mutual nearest neighbor, flow refinement, semi-dense matching, image matching, computer vision

239. ❌ RASR: Retrieval-Augmented Semantic Reasoning for Fake News Video Detection

作者: Hui Li, Peien Ding, Jun Li, Guoqi Ma, Zhanyu Liu, Ge Xu, Junfeng Yao, Jinsong Su 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06687v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出RASR框架用于假新闻视频检测，核心创新在于检索增强语义推理。高度相关关键词：1) ‘Retrieval-Augmented Generation’ (10分)：框架名称包含’Retrieval-Augmented’，核心机制是检索关联证据增强推理；2) ‘Large Language Models’ (10分)：使用专家多模态大语言模型生成深度分析报告；3) ‘Chain of Thought’和’System 2 Thinking’ (各8分)：框架强调语义推理、深度分析，符合多步推理和深度思考概念；4) ‘Hallucination Mitigation’ (8分)：假新闻检测本质是提高事实性、减少幻觉；5) ‘Pre-training’和’Explainable AI’ (各5分)：涉及领域知识迁移和可解释分析报告；其余关键词与论文内容无关或未提及。

!!! tip deepseek-chat TL;DR

该研究提出了一种检索增强语义推理框架RASR，通过检索历史关联证据和领域知识指导的多模态大语言模型推理，显著提升了假新闻视频检测的准确性和跨域泛化能力。

摘要翻译

多模态虚假新闻视频检测是维护网络信息可信度的关键研究方向。现有研究主要通过构建多模态特征融合表征或利用预训练语言模型分析视频-文本一致性来验证内容真实性。然而，这些方法仍面临以下局限：(1) 缺乏跨实例的全局语义关联，难以有效利用历史关联证据验证当前视频；(2) 跨领域语义差异阻碍通用知识迁移，缺乏领域特定专家知识的指导。为此，我们提出了一种新颖的检索增强语义推理框架。首先，跨实例语义解析与检索器将视频解构为高层语义基元，并从动态记忆库中检索相关关联证据。随后，领域引导多模态推理模块引入领域先验知识，驱动专家级多模态大语言模型生成具有领域感知的深度分析报告。最后，多视图特征解耦与融合模块通过自适应门控机制整合多维特征，实现鲁棒的真实性判定。在FakeSV和FakeTT数据集上的大量实验表明，该框架显著优于现有先进基线方法，实现了优异的跨领域泛化能力，并将整体检测准确率最高提升0.93%。

摘要 (Abstract)

Multimodal fake news video detection is a crucial research direction for maintaining the credibility of online information. Existing studies primarily verify content authenticity by constructing multimodal feature fusion representations or utilizing pre-trained language models to analyze video-text consistency. However, these methods still face the following limitations: (1) lacking cross-instance global semantic correlations, making it difficult to effectively utilize historical associative evidence to verify the current video; (2) semantic discrepancies across domains hinder the transfer of general knowledge, lacking the guidance of domain-specific expert knowledge. To this end, we propose a novel Retrieval-Augmented Semantic Reasoning (RASR) framework. First, a Cross-instance Semantic Parser and Retriever (CSPR) deconstructs the video into high-level semantic primitives and retrieves relevant associative evidence from a dynamic memory bank. Subsequently, a Domain-Guided Multimodal Reasoning (DGMP) module incorporates domain priors to drive an expert multimodal large language model in generating domain-aware, in-depth analysis reports. Finally, a Multi-View Feature Decoupling and Fusion (MVDFF) module integrates multi-dimensional features through an adaptive gating mechanism to achieve robust authenticity determination. Extensive experiments on the FakeSV and FakeTT datasets demonstrate that RASR significantly outperforms state-of-the-art baselines, achieves superior cross-domain generalization, and improves the overall detection accuracy by up to 0.93%.

关键词: fake news detection, multimodal analysis, retrieval-augmented generation, semantic reasoning, large language models, cross-domain generalization, video-text consistency, domain adaptation

240. ❌ 4D Vessel Reconstruction for Benchtop Thrombectomy Analysis

作者: Ethan Nguyen, Javier Carmona, Arisa Matsuzaki, Naoki Kaneko, Katsushi Arisaka 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06671v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于医学影像和生物力学分析领域，使用多视角相机、4D高斯泼溅和Blender合成管道等技术进行血管重建和应力分析，与绝大多数大模型和深度学习技术关键词完全无关。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文涉及生物医学工程和计算分析，属于AI在科学领域的应用，但论文本身并未明确使用AI或机器学习方法进行建模或预测，主要依赖传统计算机视觉和图形学技术，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该研究开发了一种用于血栓切除术台架测试的低成本多视角4D血管重建工作流，能够提供标准化的时间分辨表面运动学和相对位移及应力代理测量，以支持不同手术条件的比较和方法验证。

摘要翻译

引言：机械取栓术可能导致血管变形及手术相关损伤。台式模型广泛用于器械测试，但具有时间分辨率、全场三维血管运动测量能力的方法仍较为有限。方法：我们开发了一种基于九台相机的低成本多视角工作流程，用于硅胶大脑中动脉模型（2160p，20 fps）的台式取栓研究。多视角视频经过校准、分割，并通过四维高斯溅射法进行重建。重建后的点云被转换为固定连接性的边缘图，用于感兴趣区域位移追踪及基于相对表面的应力代指标计算。应力代指标值通过采用Neo-Hookean映射的边缘拉伸推导得出，并以比较性表面度量形式呈现。利用已知变形的合成Blender流程进行几何与时间验证。结果：在合成整体平移中，大多数边缘的应力代指标接近零（中位数≈0 MPa；第90百分位数为0.028 MPa），仅存在稀疏异常值。在合成牵拉（1-5 mm）中，重建结果与真实值在几何和时间上高度吻合，对称倒角距离为1.714-1.815 mm，在τ=1 mm时精度达0.964-0.972。在初步台式对比试验中（每种条件一次试验），颈段抽吸导管放置相较于颈内动脉末端放置显示出更高的最大-中位感兴趣区域位移及应力代指标值。结论：所提出的方案为取栓台式研究提供了标准化、具有时间分辨率的表面运动学数据，以及可比较的相对位移与应力代指标测量结果。该框架支持不同条件间的对比及方法验证，同时区别于绝对的血管壁应力估算。实施代码与示例数据详见https://ethanuser.github.io/vessel4D。

摘要 (Abstract)

Introduction: Mechanical thrombectomy can cause vessel deformation and procedure-related injury. Benchtop models are widely used for device testing, but time-resolved, full-field 3D vessel-motion measurements remain limited. Methods: We developed a nine-camera, low-cost multi-view workflow for benchtop thrombectomy in silicone middle cerebral artery phantoms (2160p, 20 fps). Multi-view videos were calibrated, segmented, and reconstructed with 4D Gaussian Splatting. Reconstructed point clouds were converted to fixed-connectivity edge graphs for region-of-interest (ROI) displacement tracking and a relative surface-based stress proxy. Stress-proxy values were derived from edge stretch using a Neo-Hookean mapping and reported as comparative surface metrics. A synthetic Blender pipeline with known deformation provided geometric and temporal validation. Results: In synthetic bulk translation, the stress proxy remained near zero for most edges (median $\approx$ 0 MPa; 90th percentile 0.028 MPa), with sparse outliers. In synthetic pulling (1-5 mm), reconstruction showed close geometric and temporal agreement with ground truth, with symmetric Chamfer distance of 1.714-1.815 mm and precision of 0.964-0.972 at $τ= 1$ mm. In preliminary benchtop comparative trials (one trial per condition), cervical aspiration catheter placement showed higher max-median ROI displacement and stress-proxy values than internal carotid artery terminus placement. Conclusion: The proposed protocol provides standardized, time-resolved surface kinematics and comparative relative displacement and stress proxy measurements for thrombectomy benchtop studies. The framework supports condition-to-condition comparisons and methods validation, while remaining distinct from absolute wall-stress estimation. Implementation code and example data are available at https://ethanuser.github.io/vessel4D

关键词: 4D vessel reconstruction, mechanical thrombectomy, benchtop analysis, multi-view imaging, Gaussian Splatting, stress proxy, silicone phantoms, displacement tracking

241. ❌ VDPP: Video Depth Post-Processing for Speed and Scalability

作者: Daewon Yoon, Injun Baek, Sangyu Han, Yearim Kim, Nojun Kwak 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06665v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文VDPP专注于视频深度估计的后处理技术，提出了一种纯几何优化的框架以提高速度和精度。所有评分关键词均与大语言模型、深度学习技术原理或AI在科学领域的应用直接相关，而本文研究的是计算机视觉中的视频深度估计，属于传统深度学习应用，未涉及大模型技术、模型训练方法、推理优化、对齐技术、代理系统或科学AI应用等主题，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对视频深度估计中现有后处理方法速度慢、精度低的问题，提出了VDPP框架，通过纯几何优化在低分辨率空间进行改进，实现了高速（>43.5 FPS）且与端到端系统相当的时间一致性，为实时边缘部署提供了实用解决方案。

摘要翻译

视频深度估计对于在自动驾驶到混合现实等应用中提供三维场景结构至关重要。当前端到端视频深度模型已确立了最先进的性能。尽管现有端到端（E2E）模型取得了最优性能，但它们作为紧耦合系统存在明显缺陷：每当更优秀的单图像深度估计器发布时，这些系统都会面临严重的适应滞后问题。为缓解此问题，以NVDS为代表的后处理方法提供了模块化即插即用方案，无需重新训练即可整合任何持续演进的图像深度模型。然而，由于速度、精度和对RGB输入的依赖限制，现有后处理方法仍难以匹配端到端系统的效率与实用性。本研究通过提出VDPP（视频深度后处理框架）重新激活后处理方法的作用，该框架提升了视频深度估计后处理的速度与精度。通过将范式从计算昂贵的场景重建转向针对性几何优化，VDPP仅在低分辨率空间进行纯几何优化。该设计在匹配端到端系统时序一致性的同时实现了卓越速度（在NVIDIA Jetson Orin Nano平台>43.5 FPS），其通过密集残差学习驱动几何表征而非完整重建。此外，VDPP的无RGB架构确保了真正的可扩展性，能够即时集成任何持续演进的图像深度模型。实验结果表明，VDPP在速度、精度和内存效率方面实现了更优平衡，使其成为实时边缘部署最具实用性的解决方案。项目页面详见：https://github.com/injun-baek/VDPP

摘要 (Abstract)

Video depth estimation is essential for providing 3D scene structure in applications ranging from autonomous driving to mixed reality. Current end-to-end video depth models have established state-of-the-art performance. Although current end-to-end (E2E) models have achieved state-of-the-art performance, they function as tightly coupled systems that suffer from a significant adaptation lag whenever superior single-image depth estimators are released. To mitigate this issue, post-processing methods such as NVDS offer a modular plug-and-play alternative to incorporate any evolving image depth model without retraining. However, existing post-processing methods still struggle to match the efficiency and practicality of E2E systems due to limited speed, accuracy, and RGB reliance. In this work, we revitalize the role of post-processing by proposing VDPP (Video Depth Post-Processing), a framework that improves the speed and accuracy of post-processing methods for video depth estimation. By shifting the paradigm from computationally expensive scene reconstruction to targeted geometric refinement, VDPP operates purely on geometric refinements in low-resolution space. This design achieves exceptional speed (>43.5 FPS on NVIDIA Jetson Orin Nano) while matching the temporal coherence of E2E systems, with dense residual learning driving geometric representations rather than full reconstructions. Furthermore, our VDPP’s RGB-free architecture ensures true scalability, enabling immediate integration with any evolving image depth model. Our results demonstrate that VDPP provides a superior balance of speed, accuracy, and memory efficiency, making it the most practical solution for real-time edge deployment. Our project page is at https://github.com/injun-baek/VDPP

关键词: video depth estimation, post-processing, geometric refinement, real-time, edge deployment, temporal coherence, RGB-free, VDPP

242. ❌ Towards Robust Content Watermarking Against Removal and Forgery Attacks

作者: Yifan Zhu, Yihan Wang, Xiao-Shan Gao 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06662v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究文本到图像扩散模型的内容水印技术，专注于对抗移除和伪造攻击的鲁棒性方法。所有评分关键词均涉及大模型技术原理、训练方法、推理优化、对齐技术、科学应用等具体方向，而本文仅涉及扩散模型的水印技术，未涉及任何大语言模型、深度学习技术原理创新或科学应用，与所有关键词均无直接关联，因此所有关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对文本到图像扩散模型的水印技术易受移除和伪造攻击的问题，提出了一种基于实例特定水印和双向检测的新范式，实验证明该方法能有效抵抗这些攻击。

摘要翻译

生成内容引发了关于版权保护、图像来源与归属的严重关切。水印技术是解决这些问题的潜在方案。近年来，针对文本到图像扩散模型的内容水印技术因其有效的检测效用与鲁棒性而得到广泛研究。然而，现有水印方法易受潜在的对抗性攻击，例如去除攻击与伪造攻击。本文构建了一种名为“基于实例特异性与双向检测的水印”（Instance-Specific watermarking with Two-Sided detection, ISTS）的新范式，以抵御去除与伪造攻击。具体而言，我们提出一种根据用户提示词语义动态控制水印注入时机与模式生成策略。此外，我们设计了一种新型双向检测方法以增强水印检测的鲁棒性。实验证明，我们的水印方案在抵抗去除攻击与伪造攻击方面具有显著优势。

摘要 (Abstract)

Generated contents have raised serious concerns about copyright protection, image provenance, and credit attribution. A potential solution for these problems is watermarking. Recently, content watermarking for text-to-image diffusion models has been studied extensively for its effective detection utility and robustness. However, these watermarking techniques are vulnerable to potential adversarial attacks, such as removal attacks and forgery attacks. In this paper, we build a novel watermarking paradigm called Instance-Specific watermarking with Two-Sided detection (ISTS) to resist removal and forgery attacks. Specifically, we introduce a strategy that dynamically controls the injection time and watermarking patterns based on the semantics of users’ prompts. Furthermore, we propose a new two-sided detection approach to enhance robustness in watermark detection. Experiments have demonstrated the superiority of our watermarking against removal and forgery attacks.

关键词: content watermarking, text-to-image diffusion models, removal attacks, forgery attacks, instance-specific watermarking, two-sided detection, robustness, adversarial attacks

243. ❌ GPAFormer: Graph-guided Patch Aggregation Transformer for Efficient 3D Medical Image Segmentation

作者: Chung-Ming Lo, I-Yun Liu, Wei-Yang Lin 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06658v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于3D医学图像分割的深度学习架构创新（GPAFormer），属于AI在生物医学领域的应用。所有关键词均与大模型（LLM）技术、训练方法、推理优化、对齐、代理系统等直接相关，而本文研究的是卷积/Transformer混合的轻量级分割网络，未涉及任何大模型技术。唯一相关的是’AI for Science OR Bioinformatics OR Cheminformatics’，因为医学图像分割属于生物信息学/科学AI应用，但并非核心创新点（核心是网络架构），故给5分（有一定关联）。其他关键词完全无关，均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种轻量级3D医学图像分割网络GPAFormer，通过多尺度注意力引导堆叠聚合和互感知图聚合模块，在多个公共数据集上以仅1.81M参数实现了高分割精度和快速推理，平衡了资源受限临床环境中的准确性与效率。

摘要翻译

深度学习已广泛应用于三维医学图像分割任务。然而，由于成像模态的多样性、数据的高维特性以及解剖结构的异质性，在多器官分割中同时实现分割精度与计算效率仍具挑战。本研究提出了GPAFormer，一种专为三维医学图像分割设计的轻量化网络架构，在保持高精度的同时强调效率。GPAFormer包含两个核心模块：多尺度注意力引导堆叠聚合模块（MASA）和互感知图块图聚合器（MPGA）。MASA采用三条具有不同感受野的并行路径，通过平面聚合进行融合，以增强网络处理不同尺寸结构的能力。MPGA采用图引导方法，基于图块间特征相似性与空间邻接性动态聚合具有相似特征分布的区域，从而提升器官内部及边界结构的区分能力。实验在公开的全身CT和MRI数据集上进行，包括BTCV、Synapse、ACDC和BraTS。与现有三维分割网络相比，仅使用1.81 M参数的GPAFormer在BTCV（75.70%）、Synapse（81.20%）、ACDC（89.32%）和BraTS（82.74%）上取得了综合最高的DSC分数。在消费级GPU上，单个BTCV验证案例的推理时间不足一秒。结果表明，GPAFormer在各种临床场景（特别是资源受限和时间敏感的临床环境）下的多器官、多模态三维分割任务中实现了精度与效率的平衡。

摘要 (Abstract)

Deep learning has been widely applied to 3D medical image segmentation tasks. However, due to the diversity of imaging modalities, the high-dimensional nature of the data, and the heterogeneity of anatomical structures, achieving both segmentation accuracy and computational efficiency in multi-organ segmentation remains a challenge. This study proposed GPAFormer, a lightweight network architecture specifically designed for 3D medical image segmentation, emphasizing efficiency while keeping high accuracy. GPAFormer incorporated two core modules: the multi-scale attention-guided stacked aggregation (MASA) and the mutual-aware patch graph aggregator (MPGA). MASA utilized three parallel paths with different receptive fields, combined through planar aggregation, to enhance the network’s capability in handling structures of varying sizes. MPGA employed a graph-guided approach to dynamically aggregate regions with similar feature distributions based on inter-patch feature similarity and spatial adjacency, thereby improving the discrimination of both internal and boundary structures of organs. Experiments were performed on public whole-body CT and MRI datasets including BTCV, Synapse, ACDC, and BraTS. Compared to the existed 3D segmentation networkd, GPAFormer using only 1.81 M parameters achieved overall highest DSC on BTCV (75.70%), Synapse (81.20%), ACDC (89.32%), and BraTS (82.74%). Using consumer level GPU, the inference time for one validation case of BTCV spent less than one second. The results demonstrated that GPAFormer balanced accuracy and efficiency in multi-organ, multi-modality 3D segmentation tasks across various clinical scenarios especially for resource-constrained and time-sensitive clinical environments.

关键词: 3D medical image segmentation, lightweight network, multi-organ segmentation, graph-guided aggregation, computational efficiency, Transformer architecture, multi-modality, clinical applications

244. ❌ Controllable Generative Video Compression

作者: Ding Ding, Daowen Li, Ying Chen, Yixin Gao, Ruixiao Dong, Kai Li, Li Li 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06655v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于视频压缩领域，提出了一种可控生成视频压缩方法，通过生成模型改善感知质量同时保持信号保真度。论文内容涉及视频编码、生成模型、结构先验等技术，但完全不涉及大语言模型、深度学习技术原理创新或AI在科学领域的应用。所有评分关键词均与大模型、深度学习技术原理或AI科学应用相关，而本文是纯粹的计算机视觉/视频处理研究，与这些关键词无任何关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种可控生成视频压缩方法，通过编码关键帧和密集控制先验来指导非关键帧生成，在保持信号保真度的同时提高了感知质量。

摘要翻译

感知视频压缩采用生成式视频建模以提升感知真实性，但常以牺牲信号保真度为代价，这与视频压缩需忠实再现视觉信号的目标相悖。为缓解感知与保真度间的矛盾，本文提出可控生成式视频压缩范式，通过多重视觉条件引导以忠实生成细节。在该范式下，场景中的代表性关键帧被编码并用作非关键帧生成的结构先验。额外编码的逐帧稠密控制先验则用于更好地保留每个非关键帧的精细结构与语义信息。在这些先验引导下，非关键帧通过具备时序与内容一致性的可控视频生成模型进行重建。此外，为精确恢复视频色彩信息，我们开发了一种基于色彩距离引导的关键帧选择算法以自适应选取关键帧。实验结果表明，CGVC在信号保真度与感知质量方面均优于以往的感知视频压缩方法。

摘要 (Abstract)

Perceptual video compression adopts generative video modeling to improve perceptual realism but frequently sacrifices signal fidelity, diverging from the goal of video compression to faithfully reproduce visual signal. To alleviate the dilemma between perception and fidelity, in this paper we propose Controllable Generative Video Compression (CGVC) paradigm to faithfully generate details guided by multiple visual conditions. Under the paradigm, representative keyframes of the scene are coded and used to provide structural priors for non-keyframe generation. Dense per-frame control prior is additionally coded to better preserve finer structure and semantics of each non-keyframe. Guided by these priors, non-keyframes are reconstructed by controllable video generation model with temporal and content consistency. Furthermore, to accurately recover color information of the video, we develop a color-distance-guided keyframe selection algorithm to adaptively choose keyframes. Experimental results show CGVC outperforms previous perceptual video compression method in terms of both signal fidelity and perceptual quality.

关键词: video compression, generative video modeling, perceptual quality, signal fidelity, controllable generation, keyframe selection, temporal consistency, content consistency

245. ❌ Euclid Quick Data Release (Q1). AgileLens: A scalable CNN-based pipeline for strong gravitational lens identification

作者: Euclid Collaboration, X. Xu, R. Chen, T. Li, A. R. Cooray, S. Schuldt, J. A. Acevedo Barroso, D. Stern, D. Scott, M. Meneghetti, G. Despali, J. Chopra, Y. Cao, M. Cheng, J. Buda, J. Zhang, J. Furumizo, R. Valencia, Z. Jiang, C. Tortora, N. E. P. Lines, T. E. Collett, S. Fotopoulou, A. Galan, A. Manjón-García, R. Gavazzi, L. Iwamoto, S. Kruk, M. Millon, P. Nugent, C. Saulder, D. Sluse, J. Wilde, M. Walmsley, F. Courbin, R. B. Metcalf, B. Altieri, A. Amara, S. Andreon, N. Auricchio, C. Baccigalupi, M. Baldi, A. Balestra, S. Bardelli, P. Battaglia, R. Bender, A. Biviano, E. Branchini, M. Brescia, S. Camera, V. Capobianco, C. Carbone, V. F. Cardone, J. Carretero, S. Casas, M. Castellano, G. Castignani, S. Cavuoti, A. Cimatti, C. Colodro-Conde, G. Congedo, C. J. Conselice, L. Conversi, Y. Copin, H. M. Courtois, M. Cropper, A. Da Silva, H. Degaudenzi, G. De Lucia, C. Dolding, H. Dole, F. Dubath, X. Dupac, S. Dusini, S. Escoffier, M. Farina, R. Farinelli, S. Farrens, S. Ferriol, F. Finelli, P. Fosalba, M. Frailis, E. Franceschi, M. Fumana, S. Galeotta, K. George, W. Gillard, B. Gillis, C. Giocoli, P. Gómez-Alvarez, J. Gracia-Carpio, A. Grazian, F. Grupp, S. V. H. Haugan, W. Holmes, F. Hormuth, A. Hornstrup, K. Jahnke, M. Jhabvala, B. Joachimi, S. Kermiche, A. Kiessling, B. Kubik, M. Kümmel, M. Kunz, H. Kurki-Suonio, A. M. C. Le Brun, S. Ligori, P. B. Lilje, V. Lindholm, I. Lloro, G. Mainetti, E. Maiorano, O. Mansutti, S. Marcin, O. Marggraf, M. Martinelli, N. Martinet, F. Marulli, R. J. Massey, E. Medinaceli, S. Mei, M. Melchior, E. Merlin, G. Meylan, A. Mora, M. Moresco, L. Moscardini, R. Nakajima, C. Neissner, R. C. Nichol, S. -M. Niemi, J. W. Nightingale, C. Padilla, S. Paltani, F. Pasian, K. Pedersen, W. J. Percival, V. Pettorino, G. Polenta, M. Poncet, L. A. Popa, F. Raison, A. Renzi, J. Rhodes, G. Riccio, E. Romelli, M. Roncarelli, R. Saglia, Z. Sakr, D. Sapone, M. Schirmer, P. Schneider, T. Schrabback, A. Secroun, G. Seidel, E. Sihvola, P. Simon, C. Sirignano, G. Sirri, L. Stanco, P. Tallada-Crespí, A. N. Taylor, I. Tereno, N. Tessore, S. Toft, R. Toledo-Moreo, F. Torradeflot, I. Tutusaus, L. Valenziano, J. Valiviita, T. Vassallo, G. Verdoes Kleijn, A. Veropalumbo, Y. Wang, J. Weller, A. Zacchei, G. Zamorani, F. M. Zerbi, E. Zucca, M. Ballardini, M. Bolzonella, C. Burigana, R. Cabanac, M. Calabrese, A. Cappi, T. Castro, J. A. Escartin Vigo, L. Gabarra, S. Hemmati, J. Macias-Perez, R. Maoli, J. Martín-Fleitas, N. Mauri, P. Monaco, A. A. Nucita, A. Pezzotta, M. Pöntinen, I. Risso, V. Scottez, M. Sereno, M. Tenti, M. Tucci, M. Viel, M. Wiesmann, Y. Akrami, I. T. Andika, G. Angora, S. Anselmi, M. Archidiacono, F. Atrio-Barandela, L. Bazzanini, P. Bergamini, D. Bertacca, M. Bethermin, F. Beutler, L. Blot, S. Borgani, M. L. Brown, S. Bruton, A. Calabro, B. Camacho Quevedo, F. Caro, C. S. Carvalho, F. Cogato, S. Conseil, O. Cucciati, S. Davini, G. Desprez, A. Díaz-Sánchez, S. Di Domizio, J. M. Diego, P. -A. Duc, V. Duret, M. Y. Elkhashab, A. Enia, Y. Fang, A. Finoguenov, A. Franco, K. Ganga, T. Gasparetto, E. Gaztanaga, F. Giacomini, F. Gianotti, G. Gozaliasl, M. Guidi, C. M. Gutierrez, A. Hall, C. Hernández-Monteagudo, H. Hildebrandt, J. Hjorth, J. J. E. Kajava, Y. Kang, V. Kansal, D. Karagiannis, K. Kiiveri, J. Kim, C. C. Kirkpatrick, F. Lepori, G. Leroy, G. F. Lesci, J. Lesgourgues, T. I. Liaudat, S. J. Liu, M. Magliocchetti, E. A. Magnier, F. Mannucci, C. J. A. P. Martins, L. Maurin, M. Miluzio, C. Moretti, G. Morgante, K. Naidoo, A. Navarro-Alsina, S. Nesseris, D. Paoletti, F. Passalacqua, K. Paterson, L. Patrizii, A. Pisani, D. Potter, G. W. Pratt, S. Quai, M. Radovich, K. Rojas, W. Roster, S. Sacquegna, M. Sahlén, D. B. Sanders, E. Sarpa, C. Scarlata, A. Schneider, M. Schultheis, D. Sciotti, E. Sellentin, L. C. Smith, K. Tanidis, C. Tao, F. Tarsitano, G. Testera, R. Teyssier, S. Tosi, A. Troja, A. Venhola, D. Vergani, G. Vernardos, G. Verza, S. Vinciguerra, N. A. Walton, A. H. Wright, H. W. Yeung 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06648v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于天文学领域的强引力透镜识别，使用CNN（VGG16）构建端到端处理流程，涉及数据预处理、图像增强、迭代微调和候选系统筛选。所有关键词均与大模型、深度学习技术原理或AI在科学领域的应用相关，但论文仅涉及传统CNN在特定科学任务（天文学）中的应用，未涉及大模型、MoE、SLMs、缩放定律、预训练/后训练、对齐、RLHF、PEFT、RAG、上下文扩展、注意力优化、推理方法、智能体、量化、解码加速、幻觉缓解、可解释性、世界模型、模型合并、上下文学习等大模型相关技术。唯一相关的是"AI for Science"，因为论文将深度学习应用于天文学研究，属于AI在科学领域的应用，但并非核心创新点，因此给5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文开发了一个基于CNN的端到端迭代流程（AgileLens），用于从Euclid Q1成像数据中高效识别强引力透镜系统，通过数据预处理、图像增强和迭代微调，最终模型在人类评估中发现了441个A/B级候选透镜系统，包括130个新发现。

摘要翻译

我们提出一种用于高效识别强星系-星系引力透镜系统的端到端迭代流程，并将其应用于欧几里得望远镜Q1成像数据。该流程从可见光通道（VIS）星表出发，剔除点源，对偏折星系施加星等截断（I$_E$ ≤ 24），并运行像素级伪影/噪声滤波器以构建96×96像素的切出图像；通过采用以VIS通道为基准的亮度合成方案，构建了VIS+近红外光谱光度计（NISP）彩色合成图像，该方案在保留VIS形态信息的同时维持了NISP的色彩对比度。一个仅使用VIS数据的初始分类器提供明确的正样本及典型伪样本，我们据此构建了形态学平衡的负样本集，并对稀缺的正样本进行数据增强。在初步研究的六个卷积神经网络（CNN）中，改进的VGG16架构（采用全局平均池化及256/128全连接层，最后九层可训练）表现最佳；训练集从27个初始透镜样本（增强至1809个）加2000个负样本，扩展至包含30,686幅图像的彩色数据集。经过三轮迭代微调后，由最终模型排名前4000的候选样本经人工分级，共获得441个A/B级候选透镜系统，其中311个与现有Q1强透镜星表重叠，另有130个新增A/B级候选体（9个A级，121个B级）为首次报告。独立测试显示，在考虑中心偏移样本的情况下，模型在其前20,000个预测中成功复现了905个Q1候选透镜中的740个（复现率81.8%）。候选体覆盖I$_E$ ≃ 17–24 AB星等范围（中位数21.3 AB星等），其Y$_E$–H$_E$颜色比母样本更红，与大质量早型偏折星系的特征一致。每轮训练迭代需小型团队耗时一周完成，该方法可轻松扩展至未来欧几里得数据发布；后续工作将通过透镜注入技术校准选择函数，采用不确定性感知的主动学习提升召回率，并探索结合多尺度或注意力机制神经网络与快速后验验证系统（该验证系统将透镜模型整合至分类流程中）的方案。

摘要 (Abstract)

We present an end-to-end, iterative pipeline for efficient identification of strong galaxy–galaxy lensing systems, applied to the Euclid Q1 imaging data. Starting from VIS catalogues, we reject point sources, apply a magnitude cut (I$_E$ $\leq$ 24) on deflectors, and run a pixel-level artefact/noise filter to build 96 $\times$ 96 pix cutouts; VIS+NISP colour composites are constructed with a VIS-anchored luminance scheme that preserves VIS morphology and NISP colour contrast. A VIS-only seed classifier supplies clear positives and typical impostors, from which we curate a morphology-balanced negative set and augment scarce positives. Among the six CNNs studied initially, a modified VGG16 (GlobalAveragePooling + 256/128 dense layers with the last nine layers trainable) performs best; the training set grows from 27 seed lenses (augmented to 1809) plus 2000 negatives to a colour dataset of 30,686 images. After three rounds of iterative fine-tuning, human grading of the top 4000 candidates ranked by the final model yields 441 Grade A/B candidate lensing systems, including 311 overlapping with the existing Q1 strong-lens catalogue, and 130 additional A/B candidates (9 As and 121 Bs) not previously reported. Independently, the model recovers 740 out of 905 (81.8%) candidate Q1 lenses within its top 20,000 predictions, considering off-centred samples. Candidates span I$_E$ $\simeq$ 17–24 AB mag (median 21.3 AB mag) and are redder in Y$_E$–H$_E$ than the parent population, consistent with massive early-type deflectors. Each training iteration required a week for a small team, and the approach easily scales to future Euclid releases; future work will calibrate the selection function via lens injection, extend recall through uncertainty-aware active learning, explore multi-scale or attention-based neural networks with fast post-hoc vetters that incorporate lens models into the classification.

关键词: strong gravitational lensing, CNN pipeline, Euclid Q1 data, iterative fine-tuning, candidate identification, VGG16, astronomy AI, image classification

246. ❌ Variational Feature Compression for Model-Specific Representations

作者: Zinan Guo, Zihan Wang, Chuan Yan, Liuhuo Wan, Ethan Ma, Guangdong Bai 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06644v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究深度学习中的隐私保护问题，提出了一种特征压缩框架来限制跨模型的数据重用，核心涉及变分自编码器、KL散度、梯度显著性等技术。所有评分关键词均与大模型技术原理、训练方法、推理优化、应用领域等直接相关，而本文专注于传统深度学习模型的隐私防御，未涉及大模型、LLM、MoE、量化、推理加速、AI for Science等任何关键词领域，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对深度学习推理中数据被未经授权模型重用的隐私问题，提出了一种变分特征压缩框架，通过动态二进制掩码抑制对指定任务无用的潜在维度，在CIFAR-100上使指定分类器保持高精度同时将非授权模型准确率降至2%以下。

摘要翻译

随着深度学习推理日益部署于共享及云端环境中，输入数据再用途问题逐渐凸显——即提交用于特定任务的数据可能被未授权模型重新用于其他任务。现有隐私防御机制主要集中于限制数据访问，但对已发布表征所能支持的下游用途控制有限。本文提出一种特征提取框架，在保持指定分类器精度的同时抑制跨模型迁移。该框架采用变分潜在瓶颈结构，通过任务驱动的交叉熵目标与KL正则化进行训练（不包含任何像素级重建损失），将输入编码至紧凑的潜在空间。基于各维度KL散度及针对冻结目标模型的梯度显著性计算的动态二值掩码，可抑制对目标任务信息贡献度低的潜在维度。由于显著性计算需要梯度访问，编码器在白盒环境下训练，而推理阶段仅需对冻结目标模型进行前向传播。在CIFAR-100数据集上，经处理的表征对指定分类器保持高效用，同时将所有非目标分类器的准确率降至2%以下，实现相对于非目标模型超过45倍的抑制比。在CIFAR-10、Tiny ImageNet和Pascal VOC上的初步实验表明该方法可扩展至不同任务场景，但需进一步评估其对自适应攻击的鲁棒性。

摘要 (Abstract)

As deep learning inference is increasingly deployed in shared and cloud-based settings, a growing concern is input repurposing, in which data submitted for one task is reused by unauthorized models for another. Existing privacy defenses largely focus on restricting data access, but provide limited control over what downstream uses a released representation can still support. We propose a feature extraction framework that suppresses cross-model transfer while preserving accuracy for a designated classifier. The framework employs a variational latent bottleneck, trained with a task-driven cross-entropy objective and KL regularization, but without any pixel-level reconstruction loss, to encode inputs into a compact latent space. A dynamic binary mask, computed from per-dimension KL divergence and gradient-based saliency with respect to the frozen target model, suppresses latent dimensions that are uninformative for the intended task. Because saliency computation requires gradient access, the encoder is trained in a white-box setting, whereas inference requires only a forward pass through the frozen target model. On CIFAR-100, the processed representations retain strong utility for the designated classifier while reducing the accuracy of all unintended classifiers to below 2%, yielding a suppression ratio exceeding 45 times relative to unintended models. Preliminary experiments on CIFAR-10, Tiny ImageNet, and Pascal VOC provide exploratory evidence that the approach extends across task settings, although further evaluation is needed to assess robustness against adaptive adversaries.

关键词: feature compression, privacy defense, variational latent bottleneck, cross-model transfer suppression, dynamic binary mask, KL divergence, gradient saliency, white-box training

247. ❌ SubFLOT: Submodel Extraction for Efficient and Personalized Federated Learning via Optimal Transport

作者: Zheng Jiang, Nan He, Yiming Chen, Lifeng Sun 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06631v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于联邦学习（FL）中的个性化剪枝和优化传输技术，旨在解决系统异构性和统计异构性问题。所有评分关键词均与大模型、深度学习技术原理或科学AI应用直接相关，而本文研究的是通用的联邦学习框架优化，未涉及大模型技术、深度学习创新或特定科学领域应用，因此与所有关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为SubFLOT的新框架，通过最优传输增强剪枝和基于缩放的自适应正则化，解决了联邦学习中服务器端个性化剪枝的难题，从而在资源受限的边缘设备上实现了高效且个性化的模型部署。

摘要翻译

联邦学习（Federated Learning, FL）能够在保护数据隐私的前提下实现协同模型训练，但其实际部署受到系统异构性与统计异构性的制约。联邦网络剪枝为缓解这些问题提供了一条路径，然而现有方法面临一个关键困境：服务器端剪枝缺乏个性化，而客户端剪枝对于资源受限的设备而言计算成本过高。此外，剪枝过程本身会引发异构子模型间显著的参数差异，从而破坏训练稳定性并阻碍全局收敛。为应对这些挑战，我们提出了SubFLOT，一种新颖的服务器端个性化联邦剪枝框架。SubFLOT引入了一个最优传输增强剪枝（Optimal Transport-enhanced Pruning, OTP）模块，该模块将历史客户端模型视为本地数据分布的代理，将剪枝任务构建为一个Wasserstein距离最小化问题，从而在不访问原始数据的情况下生成定制化的子模型。同时，为抵消参数差异，我们基于缩放的自适应正则化（Scaling-based Adaptive Regularization, SAR）模块自适应地惩罚子模型相对于全局模型的偏离，其惩罚强度根据客户端的剪枝率进行缩放。全面的实验表明，SubFLOT在各项指标上持续且显著地优于现有先进方法，这凸显了其在资源受限的边缘设备上部署高效、个性化模型的潜力。

摘要 (Abstract)

Federated Learning (FL) enables collaborative model training while preserving data privacy, but its practical deployment is hampered by system and statistical heterogeneity. While federated network pruning offers a path to mitigate these issues, existing methods face a critical dilemma: server-side pruning lacks personalization, whereas client-side pruning is computationally prohibitive for resource-constrained devices. Furthermore, the pruning process itself induces significant parametric divergence among heterogeneous submodels, destabilizing training and hindering global convergence. To address these challenges, we propose SubFLOT, a novel framework for server-side personalized federated pruning. SubFLOT introduces an Optimal Transport-enhanced Pruning (OTP) module that treats historical client models as proxies for local data distributions, formulating the pruning task as a Wasserstein distance minimization problem to generate customized submodels without accessing raw data. Concurrently, to counteract parametric divergence, our Scaling-based Adaptive Regularization (SAR) module adaptively penalizes a submodel’s deviation from the global model, with the penalty’s strength scaled by the client’s pruning rate. Comprehensive experiments demonstrate that SubFLOT consistently and substantially outperforms state-of-the-art methods, underscoring its potential for deploying efficient and personalized models on resource-constrained edge devices.

关键词: Federated Learning, Personalized Pruning, Optimal Transport, Model Compression, Edge Computing, Heterogeneous Systems, Wasserstein Distance, Adaptive Regularization

248. ❌ WeatherRemover: All-in-one Adverse Weather Removal with Multi-scale Feature Map Compression

作者: Weikai Qu, Sijun Liang, Cheng Pan, Zikuan Yang, Guanchi Zhou, Xianjun Fu, Bo Liu, Changmiao Wang, Ahmed Elazab 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06623v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文WeatherRemover专注于计算机视觉中的图像去天气任务，提出了一种结合UNet结构、门控机制和多尺度金字塔视觉Transformer的模型，旨在高效去除雨、雪、雾等天气影响。所有评分关键词均与大语言模型（LLMs）、深度学习技术原理或科学AI应用直接相关，而本文研究的是传统计算机视觉任务，未涉及大模型、语言处理、代理系统、对齐训练、推理加速等主题，也未应用于生物信息学或化学信息学等科学领域。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为WeatherRemover的轻量级模型，用于高效去除图像中的多种天气影响（如雨、雪、雾），在恢复质量、参数效率、计算开销和内存使用之间取得了平衡。

摘要翻译

在恶劣天气条件下拍摄的照片常因雨、雪、雾的干扰而出现模糊、遮挡和亮度不足等问题，这会严重影响后续计算机视觉任务的性能，因此去除天气效应成为图像增强的关键步骤。现有方法主要针对特定天气条件，仅有少数能够处理多种天气场景。然而，主流方法往往忽视性能考量，导致参数量大、推理时间长且内存成本高。本研究提出WeatherRemover模型，旨在增强对多种天气影响图像的恢复能力，同时兼顾性能平衡。该模型采用类似UNet的结构，结合门控机制与多尺度金字塔视觉Transformer（Vision Transformer）。它利用源自卷积神经网络（Convolutional Neural Networks, CNN）的通道注意力机制优化特征提取，同时通过线性空间缩减来降低注意力计算需求。门控机制被策略性地部署在前馈和下采样阶段，通过选择性处理冗余信息并减轻其对学习的影响，从而优化信息处理流程。这种方法能够自适应地选择关键数据，确保卓越的恢复效果并最大化效率。此外，我们的轻量化模型在恢复质量、参数效率、计算开销和内存使用之间实现了最佳平衡，这使其区别于其他多天气模型，从而有效满足实际应用需求。源代码发布于https://github.com/RICKand-MORTY/WeatherRemover。

摘要 (Abstract)

Photographs taken in adverse weather conditions often suffer from blurriness, occlusion, and low brightness due to interference from rain, snow, and fog. These issues can significantly hinder the performance of subsequent computer vision tasks, making the removal of weather effects a crucial step in image enhancement. Existing methods primarily target specific weather conditions, with only a few capable of handling multiple weather scenarios. However, mainstream approaches often overlook performance considerations, resulting in large parameter sizes, long inference times, and high memory costs. In this study, we introduce the WeatherRemover model, designed to enhance the restoration of images affected by various weather conditions while balancing performance. Our model adopts a UNet-like structure with a gating mechanism and a multi-scale pyramid vision Transformer. It employs channel-wise attention derived from convolutional neural networks to optimize feature extraction, while linear spatial reduction helps curtail the computational demands of attention. The gating mechanisms, strategically placed within the feed-forward and downsampling phases, refine the processing of information by selectively addressing redundancy and mitigating its influence on learning. This approach facilitates the adaptive selection of essential data, ensuring superior restoration and maximizing efficiency. Additionally, our lightweight model achieves an optimal balance between restoration quality, parameter efficiency, computational overhead, and memory usage, distinguishing it from other multi-weather models, thereby meeting practical application demands effectively. The source code is available at https://github.com/RICKand-MORTY/WeatherRemover.

关键词: adverse weather removal, image enhancement, multi-scale pyramid vision Transformer, gating mechanism, channel-wise attention, linear spatial reduction, lightweight model, computational efficiency

249. ❌ Balancing Efficiency and Restoration: Lightweight Mamba-Based Model for CT Metal Artifact Reduction

作者: Weikai Qu, Sijun Liang, Xianfeng Li, Cheng Pan, An Yan, Ahmed Elazab, Shanzhou Niu, Dong Zeng, Xiang Wan, Changmiao Wang 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06622v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文专注于医学影像（CT）中的金属伪影去除问题，提出了一种基于Mamba架构的轻量级模型MARMamba。所有关键词中，只有"AI for Science OR Bioinformatics OR Cheminformatics"与论文高度相关（10分），因为该研究属于AI在生物医学/科学领域的应用（医学影像分析）。其他关键词均涉及大语言模型（LLM）及其相关技术（如MoE、RLHF、RAG、Agent等），而本文研究的是计算机视觉/医学影像领域的专用模型（基于Mamba的UNet），未涉及任何语言模型、大模型技术原理或通用AI方法，因此其他26个关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文针对CT成像中金属植入物产生的严重伪影问题，提出了一种轻量级的Mamba-based模型MARMamba，能在有效去除不同尺寸金属伪影的同时保持解剖结构完整性，并在计算效率与恢复效果之间取得了良好平衡。

摘要翻译

在计算机断层扫描成像中，金属植入物常产生严重的伪影，损害图像质量并影响诊断准确性。现有方法主要面临三大挑战：器官与组织结构受损、对正弦图数据的依赖，以及资源使用与修复效率之间的失衡。针对这些问题，我们提出了MARMamba模型，该模型在有效消除不同尺寸金属所致伪影的同时，保持了图像原始解剖结构的完整性。此外，该模型仅关注受金属伪影影响的CT图像，无需额外输入数据。该模型采用精简的UNet架构，并以多尺度Mamba（MS-Mamba）作为核心模块。在MS-Mamba中，翻转Mamba块通过多方向分析图像以捕捉全面的上下文信息。随后，平均最大前馈网络将关键特征与平均特征相融合，以抑制伪影。这种组合使MARMamba能够高效消除伪影。实验结果表明，我们的模型在减少金属伪影方面表现优异，相较于其他模型具有明显优势，并在计算需求、内存占用与参数量之间实现了最佳平衡，凸显了其在实际应用中的实用价值。模型代码已公开于：https://github.com/RICKand-MORTY/MARMamba。

摘要 (Abstract)

In computed tomography imaging, metal implants frequently generate severe artifacts that compromise image quality and hinder diagnostic accuracy. There are three main challenges in the existing methods: the deterioration of organ and tissue structures, dependence on sinogram data, and an imbalance between resource use and restoration efficiency. Addressing these issues, we introduce MARMamba, which effectively eliminates artifacts caused by metals of different sizes while maintaining the integrity of the original anatomical structures of the image. Furthermore, this model only focuses on CT images affected by metal artifacts, thus negating the requirement for additional input data. The model is a streamlined UNet architecture, which incorporates multi-scale Mamba (MS-Mamba) as its core module. Within MS-Mamba, a flip mamba block captures comprehensive contextual information by analyzing images from multiple orientations. Subsequently, the average maximum feed-forward network integrates critical features with average features to suppress the artifacts. This combination allows MARMamba to reduce artifacts efficiently. The experimental results demonstrate that our model excels in reducing metal artifacts, offering distinct advantages over other models. It also strikes an optimal balance between computational demands, memory usage, and the number of parameters, highlighting its practical utility in the real world. The code of the presented model is available at: https://github.com/RICKand-MORTY/MARMamba.

关键词: CT metal artifact reduction, Mamba-based model, lightweight architecture, medical image restoration, computational efficiency, UNet, multi-scale Mamba, artifact suppression

250. ❌ Holistic Optimal Label Selection for Robust Prompt Learning under Partial Labels

作者: Yaqi Zhao, Haoliang Sun, Yating Wang, Yongshun Gong, Yilong Yin 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06614v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是视觉-语言模型的提示学习（prompt learning）在部分标签下的鲁棒性问题，提出了一种名为HopS的标签选择方法。所有关键词均针对大语言模型（LLM）或深度学习技术原理的创新，而本文聚焦于视觉-语言模型（VLM）的提示学习，属于计算机视觉与自然语言处理的交叉领域，但未涉及LLM核心架构、训练方法、推理优化、对齐技术、代理系统或科学AI应用。因此，与所有关键词均无直接关联，相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对部分标签下视觉-语言模型提示学习性能受限的问题，提出了结合局部密度过滤和全局最优传输的Holistic Optimal Label Selection（HopS）方法，在八个基准数据集上显著提升了弱监督下的性能。

摘要翻译

提示学习作为一种参数高效的方法，在将大规模预训练视觉-语言模型适配至下游任务时获得了广泛关注。然而，当仅能获得部分标签时，其性能常受限于标签歧义与监督信息不足。为解决这一问题，我们提出整体最优标签选择方法，通过两种互补策略利用预训练特征编码器的泛化能力。首先，我们设计了一种基于局部密度的过滤器，从最近邻候选标签集中选择高频标签，并利用softmax分数识别最可能的标签，从而捕捉特征空间中的结构规律性。其次，我们引入基于最优传输的全局选择目标，将均匀采样分布映射至批次内的候选标签分布。通过最小化期望传输成本，该方法能够确定最可能的标签分配。这两种策略协同工作，从局部和全局视角提供鲁棒的标签选择。在八个基准数据集上的大量实验表明，HopS在部分监督条件下持续提升性能，并优于所有基线方法。这些结果凸显了整体标签选择的优势，为弱监督环境下的提示学习提供了实用解决方案。

摘要 (Abstract)

Prompt learning has gained significant attention as a parameter-efficient approach for adapting large pre-trained vision-language models to downstream tasks. However, when only partial labels are available, its performance is often limited by label ambiguity and insufficient supervisory information. To address this issue, we propose Holistic Optimal Label Selection (HopS), leveraging the generalization ability of pre-trained feature encoders through two complementary strategies. First, we design a local density-based filter that selects the top frequent labels from the nearest neighbors’ candidate sets and uses the softmax scores to identify the most plausible label, capturing structural regularities in the feature space. Second, we introduce a global selection objective based on optimal transport that maps the uniform sampling distribution to the candidate label distributions across a batch. By minimizing the expected transport cost, it can determine the most likely label assignments. These two strategies work together to provide robust label selection from both local and global perspectives. Extensive experiments on eight benchmark datasets show that HopS consistently improves performance under partial supervision and outperforms all baselines. Those results highlight the merit of holistic label selection and offer a practical solution for prompt learning in weakly supervised settings.

关键词: prompt learning, vision-language models, partial labels, weakly supervised learning, label selection, optimal transport, feature encoders, robust learning

251. ❌ VAMAE: Vessel-Aware Masked Autoencoders for OCT Angiography

作者: Ilerioluwakiiye Abolade, Prince Mireku, Kelechi Chibundu, Peace Ododo, Emmanuel Idoko, Promise Omoigui, Solomon Odelola 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06583v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文VAMAE专注于医学图像分析（OCTA血管分割），属于AI for Science（生物信息学/医学影像）领域，因此与’AI for Science OR Bioinformatics OR Cheminformatics’相关（8分）。论文核心贡献是提出了一种自监督预训练方法（masked autoencoding），与’Pre-training OR Continual Pre-training OR Domain Adaptation’高度相关（10分）。其他关键词主要涉及大语言模型（LLM）技术、推理、对齐、优化等，与本文的计算机视觉和医学图像分析焦点无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文针对OCTA图像中稀疏血管结构表示学习困难的问题，提出了一种血管感知的掩码自编码预训练框架VAMAE，通过解剖学引导的掩蔽和多目标重建，在有限标注设置下显著提升了血管分割性能。

摘要翻译

光学相干断层扫描血管成像（OCTA）能够无创地呈现视网膜微血管结构，但由于血管结构稀疏且拓扑约束强，学习其鲁棒表征仍具挑战性。现有的许多自监督学习方法（包括掩码自编码器）主要针对密集的自然图像设计，依赖于均匀掩码和像素级重建，这可能不足以有效捕捉血管的几何形态。
我们提出VAMAE，一种面向OCTA图像自监督预训练的血管感知掩码自编码框架。该方法结合了基于解剖学信息的掩码策略，利用血管显著性和骨架线索强调血管富集区域，促使模型关注血管连通性与分支模式。此外，预训练目标包含对多个互补目标的重建，使模型能够同时捕获外观、结构及拓扑信息。
我们在OCTA-500基准数据集上，针对不同监督程度的多种血管分割任务评估了所提出的预训练策略。结果表明，与标准的掩码自编码基线方法相比，血管感知掩码与多目标重建策略带来了持续的性能提升，尤其在标签有限的情况下效果显著，这揭示了几何感知的自监督学习在OCTA分析中的应用潜力。

摘要 (Abstract)

Optical coherence tomography angiography (OCTA) provides non-invasive visualization of retinal microvasculature, but learning robust representations remains challenging due to sparse vessel structures and strong topological constraints. Many existing self-supervised learning approaches, including masked autoencoders, are primarily designed for dense natural images and rely on uniform masking and pixel-level reconstruction, which may inadequately capture vascular geometry. We propose VAMAE, a vessel-aware masked autoencoding framework for self-supervised pretraining on OCTA images. The approach incorporates anatomically informed masking that emphasizes vessel-rich regions using vesselness and skeleton-based cues, encouraging the model to focus on vascular connectivity and branching patterns. In addition, the pretraining objective includes reconstructing multiple complementary targets, enabling the model to capture appearance, structural, and topological information. We evaluate the proposed pretraining strategy on the OCTA-500 benchmark for several vessel segmentation tasks under varying levels of supervision. The results indicate that vessel-aware masking and multi-target reconstruction provide consistent improvements over standard masked autoencoding baselines, particularly in limited-label settings, suggesting the potential of geometry-aware self-supervised learning for OCTA analysis.

关键词: OCTA, masked autoencoding, self-supervised learning, vessel segmentation, medical image analysis, pre-training, vascular geometry, OCTA-500

252. ❌ LiftFormer: Lifting and Frame Theory Based Monocular Depth Estimation Using Depth and Edge Oriented Subspace Representation

作者: Shuai Li, Huibin Bai, Yanbo Gao, Chong Lv, Hui Yuan, Chuankun Li, Wei Hua, Tian Xie 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06576v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉中的单目深度估计（MDE），提出了一种基于提升理论和框架理论的LiftFormer方法，用于构建连接图像颜色特征和深度值的中间子空间。所有评分关键词均与大语言模型（LLMs）、深度学习技术原理创新或AI在科学领域的应用直接相关，而本文研究的是传统的计算机视觉任务，未涉及大模型、深度学习创新技术或科学AI应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于提升理论和框架理论的LiftFormer方法，用于解决单目深度估计问题，通过构建深度导向和边缘感知的子空间表示，在多个数据集上实现了最先进的性能。

摘要翻译

单目深度估计（Monocular Depth Estimation, MDE）因其在三维视觉中的重要作用，近年来受到越来越多的关注。MDE旨在从单目图像或视频中估计深度图以表征场景的三维结构，这是一个高度不适定问题。为解决此问题，本文提出一种基于提升理论拓扑的LiftFormer，用于构建一个连接图像颜色特征与深度值的中间子空间，以及一个增强边缘区域深度预测的子空间。该方法通过将深度值预测问题转化为深度导向几何表示（Depth-oriented Geometric Representation, DGR）子空间特征表示来构建MDE框架，从而搭建从颜色值到几何深度值的学习桥梁。基于框架理论，我们利用与深度区间对应的线性相关向量构建了DGR子空间，以提供冗余且鲁棒的特征表示。图像空间特征被转换到DGR子空间中，这些特征直接与深度值相对应。此外，考虑到深度图中边缘区域通常呈现剧烈变化且易被错误预测，本文构建了一个边缘感知表示（Edge-aware Representation, ER）子空间，将深度特征进行转换并进一步用于增强边缘周围的局部特征。实验结果表明，我们的LiftFormer在多个广泛使用的数据集上达到了最先进的性能，消融研究也验证了所提出的两个提升模块在LiftFormer中的有效性。

摘要 (Abstract)

Monocular depth estimation (MDE) has attracted increasing interest in the past few years, owing to its important role in 3D vision. MDE is the estimation of a depth map from a monocular image/video to represent the 3D structure of a scene, which is a highly ill-posed problem. To solve this problem, in this paper, we propose a LiftFormer based on lifting theory topology, for constructing an intermediate subspace that bridges the image color features and depth values, and a subspace that enhances the depth prediction around edges. MDE is formulated by transforming the depth value prediction problem into depth-oriented geometric representation (DGR) subspace feature representation, thus bridging the learning from color values to geometric depth values. A DGR subspace is constructed based on frame theory by using linearly dependent vectors in accordance with depth bins to provide a redundant and robust representation. The image spatial features are transformed into the DGR subspace, where these features correspond directly to the depth values. Moreover, considering that edges usually present sharp changes in a depth map and tend to be erroneously predicted, an edge-aware representation (ER) subspace is constructed, where depth features are transformed and further used to enhance the local features around edges. The experimental results demonstrate that our LiftFormer achieves state-of-the-art performance on widely used datasets, and an ablation study validates the effectiveness of both proposed lifting modules in our LiftFormer.

关键词: Monocular Depth Estimation, LiftFormer, Lifting Theory, Frame Theory, Depth-oriented Geometric Representation, Edge-aware Representation, Subspace Representation, 3D Vision

253. ❌ Measurement of Generative AI Workload Power Profiles for Whole-Facility Data Center Infrastructure Planning

作者: Roberto Vercellino, Jared Willard, Gustavo Campos, Weslley da Silva Pereira, Olivia Hull, Matthew Selensky, Juliane Mueller 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07345v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是生成式AI工作负载的功耗测量和数据中心基础设施规划，属于AI基础设施和能源效率领域。论文内容聚焦于功耗测量方法、数据集创建和能源建模，不涉及大模型技术原理、训练方法、推理优化、对齐技术、AI应用等具体技术主题。所有关键词均与大模型技术、训练方法、推理优化、对齐、应用等具体技术内容相关，而本文是基础设施层面的研究，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种将高分辨率AI工作负载功耗测量与全设施能源需求联系起来的方法，创建了公开的功耗数据集，并开发了自下而上的数据中心能源模型，用于指导电网连接、现场发电和分布式微电网等基础设施规划。

摘要翻译

生成式人工智能（AI）的快速发展带来了前所未有的计算需求，导致数据中心的能源足迹显著增加。然而，现有的功耗数据大多具有专有性，且报告的分辨率各不相同，这为估算整个设施的能源使用情况和规划基础设施带来了挑战。本研究提出一种方法，通过将高分辨率工作负载功耗测量与整个设施的能源需求相连接，以弥合这一差距。利用NLR配备NVIDIA H100 GPU的高性能计算数据中心，我们以0.1秒的分辨率测量了AI训练、微调和推理任务的工作负载功耗。工作负载使用MLCommons基准测试进行模型训练和微调表征，并采用vLLM基准测试进行推理表征，从而实现可复现且标准化的工作负载分析。功耗特征数据集已公开提供。随后，我们采用自下而上、事件驱动的数据中心能源模型，将这些功耗特征扩展至整个设施层面。由此得到的全设施能源特征捕捉了由AI工作负载和用户行为驱动的真实时间波动，可用于为电网接入、现场能源发电和分布式微电网等基础设施规划提供参考。

摘要 (Abstract)

The rapid growth of generative artificial intelligence (AI) has introduced unprecedented computational demands, driving significant increases in the energy footprint of data centers. However, existing power consumption data is largely proprietary and reported at varying resolutions, creating challenges for estimating whole-facility energy use and planning infrastructure. In this work, we present a methodology that bridges this gap by linking high-resolution workload power measurements to whole-facility energy demand. Using NLR’s high-performance computing data center equipped with NVIDIA H100 GPUs, we measure power consumption of AI workloads at 0.1-second resolution for AI training, fine-tuning and inference jobs. Workloads are characterized using MLCommons benchmarks for model training and fine-tuning, and vLLM benchmarks for inference, enabling reproducible and standardized workload profiling. The dataset of power consumption profiles is made publicly available. These power profiles are then scaled to the whole-facility-level using a bottom-up, event-driven, data center energy model. The resulting whole-facility energy profiles capture realistic temporal fluctuations driven by AI workloads and user-behavior, and can be used to inform infrastructure planning for grid connection, on-site energy generation, and distributed microgrids.

关键词: generative AI, power consumption, data center, energy footprint, infrastructure planning, workload profiling, energy modeling, H100 GPUs

254. ❌ How to sketch a learning algorithm

作者: Sam Gunn 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07328v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究训练数据删除对模型输出的影响，属于机器学习理论/可解释性领域，与大多数大模型技术关键词（如LLM、MoE、SFT、RAG、量化等）无直接关联。唯一相关的是’Mechanistic Interpretability OR Explainable AI’，因为论文涉及模型行为预测和可解释性，但非核心焦点，给5分。其他关键词均未涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种数据删除方案，能够预测在深度学习设置中排除特定训练数据子集后的模型输出，其预计算和预测算法在误差ε趋于零时仅比常规训练和推理慢poly(1/ε)倍。

摘要翻译

训练数据的选择如何影响人工智能模型？这一问题对于可解释性、隐私和基础科学具有核心意义。其关键在于数据删除问题：在完成合理量的预计算后，如何快速预测若从学习算法中排除某个给定的训练数据子集，模型在特定情境下将如何表现。
我们提出了一种数据删除方案，能够在深度学习场景中以可忽略误差$\varepsilon$预测模型输出。我们的预计算和预测算法相较于常规训练和推理过程，耗时仅分别增加$\mathrm{poly}(1/\varepsilon)$倍。存储需求相当于$\mathrm{poly}(1/\varepsilon)$个模型的存储量。
我们的证明基于一项称为“稳定性”的假设。与先前研究采用的假设不同，稳定性似乎与学习强大的人工智能模型完全兼容。为支持这一观点，我们通过一组微型生成式预训练模型（microgpt）的最小化实验验证了稳定性条件。相关代码可在https://github.com/SamSpo1/microgpt-sketch获取。
在技术层面，本研究基于一种通过计算随机复方向高阶导数来局部概化算术电路的新方法。前向模式自动微分技术使得这些导数的计算成本显著降低。

摘要 (Abstract)

How does the choice of training data influence an AI model? This question is of central importance to interpretability, privacy, and basic science. At its core is the data deletion problem: after a reasonable amount of precomputation, quickly predict how the model would behave in a given situation if a given subset of training data had been excluded from the learning algorithm. We present a data deletion scheme capable of predicting model outputs with vanishing error $\varepsilon$ in the deep learning setting. Our precomputation and prediction algorithms are only $\mathrm{poly}(1/\varepsilon)$ factors slower than regular training and inference, respectively. The storage requirements are those of $\mathrm{poly}(1/\varepsilon)$ models. Our proof is based on an assumption that we call “stability.” In contrast to the assumptions made by prior work, stability appears to be fully compatible with learning powerful AI models. In support of this, we show that stability is satisfied in a minimal set of experiments with microgpt. Our code is available at https://github.com/SamSpo1/microgpt-sketch. At a technical level, our work is based on a new method for locally sketching an arithmetic circuit by computing higher-order derivatives in random complex directions. Forward-mode automatic differentiation allows cheap computation of these derivatives.

关键词: data deletion, training data influence, model behavior prediction, stability assumption, deep learning, interpretability, automatic differentiation, arithmetic circuit sketching

255. ❌ Gaussian Approximation for Asynchronous Q-learning

作者: Artemy Rubtsov, Sergey Samsonov, Vladimir Ulyanov, Alexey Naumov 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07323v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究异步Q-learning算法的理论收敛性分析，属于强化学习领域，与所有评分关键词（均涉及大模型、深度学习技术原理或AI科学应用）完全无关。论文未涉及任何大模型技术、训练方法、推理优化、AI代理或科学AI应用。

!!! tip deepseek-chat TL;DR

该论文推导了异步Q-learning算法在多项式步长下的高维中心极限定理收敛速率，并证明了马尔可夫链假设下的收敛阶数。

摘要翻译

本文推导了采用多项式步长 $k^{-ω},, ω\in (1/2, 1]$ 的异步Q-learning算法生成的Polyak-Ruppert平均迭代在高维中心极限定理中的收敛速率。假设状态-动作-下一状态三元组序列 $(s_k, a_k, s_{k+1})_{k \geq 0}$ 构成一致几何遍历的马尔可夫链，我们在超矩形类上建立了高达 $n^{-1/6} \log^{4} (nS A)$ 阶的收敛速率，其中 $n$ 是算法使用的样本数，$S$ 和 $A$ 分别表示状态数和动作数。为获得此结果，我们证明了一个关于鞅差和的高维中心极限定理，该定理可能具有独立的理论价值。最后，我们给出了算法最后迭代的高阶矩的界。

摘要 (Abstract)

In this paper, we derive rates of convergence in the high-dimensional central limit theorem for Polyak-Ruppert averaged iterates generated by the asynchronous Q-learning algorithm with a polynomial stepsize $k^{-ω},, ω\in (1/2, 1]$. Assuming that the sequence of state-action-next-state triples $(s_k, a_k, s_{k+1})_{k \geq 0}$ forms a uniformly geometrically ergodic Markov chain, we establish a rate of order up to $n^{-1/6} \log^{4} (nS A)$ over the class of hyper-rectangles, where $n$ is the number of samples used by the algorithm and $S$ and $A$ denote the numbers of states and actions, respectively. To obtain this result, we prove a high-dimensional central limit theorem for sums of martingale differences, which may be of independent interest. Finally, we present bounds for high-order moments for the algorithm’s last iterate.

关键词: Asynchronous Q-learning, Central Limit Theorem, Convergence Rate, Markov Chain, Polyak-Ruppert Averaging, High-dimensional Analysis, Martingale Differences

256. ❌ Graph Neural ODE Digital Twins for Control-Oriented Reactor Thermal-Hydraulic Forecasting Under Partial Observability

作者: Akzhol Almukhametov, Doyeong Lim, Rui Hu, Yang Liu 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07292v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究的是基于图神经网络和神经常微分方程的物理信息数字孪生模型，用于核反应堆热工水力预测。论文内容与绝大多数关键词（涉及大模型技术、训练方法、推理优化、对齐、智能体等）完全无关。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究属于AI在科学（具体是核工程）领域的应用，但论文并未明确提及生物信息学或化学信息学，且其核心是特定领域的物理建模而非通用的科学AI方法，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了一种结合物理信息图神经网络和神经常微分方程的模型，用于在部分可观测条件下对先进核反应堆进行实时热工水力状态预测，实现了高精度预测和快速推理，并能通过少量数据适应实验设施。

摘要翻译

先进反应堆的实时监控需要对全厂热工水力状态进行精确预测，包括物理传感器无法部署的位置。满足这一需求需要代理模型同时具备预测保真度、毫秒级推理能力以及对部分可观测性的鲁棒性。本研究提出了一种物理信息驱动的消息传递图神经网络与神经常微分方程耦合模型（GNN-ODE），以同步满足上述三项要求。我们将整个系统表示为有向传感器图，其边通过流动/传热感知的消息传递编码水力连通性，并通过受控神经常微分方程在连续时间内推进潜在动力学。拓扑引导的缺失节点初始化器在推演开始时重建未布设仪表的状态，随后完全以自回归方式进行预测。该GNN-ODE代理模型在系统动力学预测中取得了令人满意的结果。在保留的仿真瞬态测试中，代理模型对未布设仪表节点的预测在60秒时平均MAE为0.91K，300秒时为2.18K，缺失节点状态重建的$R^2$最高达0.995。在单GPU上推理速度约为模拟时间的105倍，支持64成员集成推演以进行不确定性量化。为评估仿真到实际的迁移能力，我们采用分层判别性微调方法，仅使用30组训练序列将预训练代理模型适配至实验设施数据。学习到的流动相关传热缩放关系恢复了与既有关联式一致的雷诺数指数，表明模型实现了超越轨迹拟合的本构关系学习。该模型能够跟踪急剧的功率变化瞬态，并在未布设仪表位置生成精确的轨迹。

摘要 (Abstract)

Real-time supervisory control of advanced reactors requires accurate forecasting of plant-wide thermal-hydraulic states, including locations where physical sensors are unavailable. Meeting this need calls for surrogate models that combine predictive fidelity, millisecond-scale inference, and robustness to partial observability. In this work, we present a physics-informed message-passing Graph Neural Network coupled with a Neural Ordinary Differential Equation (GNN-ODE) to addresses all three requirements simultaneously. We represent the whole system as a directed sensor graph whose edges encode hydraulic connectivity through flow/heat transfer-aware message passing, and we advance the latent dynamics in continuous time via a controlled Neural ODE. A topology-guided missing-node initializer reconstructs uninstrumented states at rollout start; prediction then proceeds fully autoregressively. The GNN-ODE surrogate achieves satisfactory results for the system dynamics prediction. On held-out simulation transients, the surrogate achieves an average MAE of 0.91 K at 60 s and 2.18 K at 300 s for uninstrumented nodes, with $R^2$ up to 0.995 for missing-node state reconstruction. Inference runs at approximately 105 times faster than simulated time on a single GPU, enabling 64-member ensemble rollouts for uncertainty quantification. To assess sim-to-real transfer, we adapt the pretrained surrogate to experimental facility data using layerwise discriminative fine-tuning with only 30 training sequences. The learned flow-dependent heat-transfer scaling recovers a Reynolds-number exponent consistent with established correlations, indicating constitutive learning beyond trajectory fitting. The model tracks a steep power change transient and produces accurate trajectories at uninstrumented locations.

关键词: Graph Neural Network, Neural Ordinary Differential Equation, Digital Twin, Thermal-Hydraulic Forecasting, Partial Observability, Physics-informed Model, Reactor Control, Surrogate Model

257. ❌ SL-FAC: A Communication-Efficient Split Learning Framework with Frequency-Aware Compression

作者: Zehang Lin, Miao Yang, Haihan Zhu, Zheng Lin, Jianhao Huang, Jing Yang, Guangjin Pan, Dianxin Luan, Zihan Fang, Shunzhi Zhu, Wei Ni, John Thompson 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07316v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	8.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文SL-FAC专注于分布式机器学习中的通信效率问题，特别是针对分割学习框架。其核心创新是提出了一种结合自适应频率分解和基于频率的量化压缩的方法，以减少激活和梯度传输的通信开销。这与关键词’Quantization OR Model Compression OR Low-bit Weights’高度相关，因为论文的核心技术涉及量化压缩（FQC组件）。然而，论文并未明确涉及大语言模型、深度学习技术原理的创新，或大模型在科学领域的应用。它讨论的是通用的神经网络和分布式训练框架，而非特定于大模型或科学AI应用。因此，除量化/压缩相关关键词外，其他所有关键词均不相关，评分为0。

!!! tip deepseek-chat TL;DR

该论文针对分割学习中因模型复杂性和设备数量增加导致的通信开销瓶颈问题，提出了SL-FAC框架，通过自适应频率分解和频率感知量化压缩，在保持模型收敛关键信息的同时显著减少了通信量，提高了训练效率。

摘要翻译

神经网络日益增长的复杂性阻碍了分布式机器学习在资源受限设备上的部署。分割学习通过将大型模型进行划分，并将主要训练工作量从边缘设备卸载到边缘服务器，提供了一种有前景的解决方案。然而，参与设备数量的增加和模型复杂性的提升，导致传输破碎数据（如激活值和梯度）产生显著的通信开销，这构成了分割学习的关键瓶颈。为应对这一挑战，我们提出了SL-FAC，一个通信高效的分割学习框架，包含两个关键组件：自适应频率分解和基于频率的量化压缩。AFD首先将破碎数据转换到频域，并将其分解为具有不同信息的频谱分量。随后，FQC根据每个分量的频谱能量分布，对其应用定制化的量化比特宽度。这种协同方法使SL-FAC能够在显著减少通信量的同时，策略性地保留对模型收敛至关重要的信息。大量实验证实了SL-FAC在提升训练效率方面的卓越性能。

摘要 (Abstract)

The growing complexity of neural networks hinders the deployment of distributed machine learning on resource-constrained devices. Split learning (SL) offers a promising solution by partitioning the large model and offloading the primary training workload from edge devices to an edge server. However, the increasing number of participating devices and model complexity leads to significant communication overhead from the transmission of smashed data (e.g., activations and gradients), which constitutes a critical bottleneck for SL. To tackle this challenge, we propose SL-FAC, a communication-efficient SL framework comprising two key components: adaptive frequency decomposition (AFD) and frequency-based quantization compression (FQC). AFD first transforms the smashed data into the frequency domain and decomposes it into spectral components with distinct information. FQC then applies customized quantization bit widths to each component based on its spectral energy distribution. This collaborative approach enables SL-FAC to achieve significant communication reduction while strategically preserving the information most crucial for model convergence. Extensive experiments confirm the superior performance of SL-FAC for improving the training efficiency.

关键词: split learning, communication efficiency, frequency-aware compression, adaptive frequency decomposition, quantization, distributed machine learning, edge computing, model training

258. ❌ The Theory and Practice of Highly Scalable Gaussian Process Regression with Nearest Neighbours

作者: Robert Allison, Tomasz Maciazek, Anthony Stephenson 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07267v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于高斯过程回归的理论扩展和可扩展性方法（NNGP/GPnn），属于传统机器学习/统计学习领域。所有评分关键词均与大模型、深度学习、AI对齐、推理、代理、科学AI应用等主题相关，而论文内容完全不涉及这些主题，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文为最近邻高斯过程回归（NNGP/GPnn）建立了完整的理论框架，证明了其在大样本下的统计一致性、收敛速率和超参数鲁棒性，为大规模高斯过程建模提供了可扩展且理论严谨的替代方案。

摘要翻译

高斯过程（$GP$）回归是一种广泛使用的非参数建模工具，但其计算复杂度随训练数据量呈三次方增长，限制了其在大规模数据集上的应用。一种实用的解决方案是仅利用每个测试点的最近邻数据进行预测，例如针对地理空间问题的最近邻高斯过程（$NNGP$）回归，以及适用于更广泛机器学习应用的相关可扩展方法$GPnn$。尽管这些方法在实证中表现优异，但关于$NNGP/GPnn$的大样本理论仍不完善。本文为$NNGP$和$GPnn$回归建立了一个理论框架。在温和的正则性假设下，我们推导了三个关键预测准则的几乎必然逐点极限：均方误差（$MSE$）、校准系数（$CAL$）和负对数似然（$NLL$）。随后，我们研究了$L_2$风险，证明了其普遍一致性，并指出该风险达到了Stone极小极大速率$n^{-2α/(2p+d)}$，其中$α$和$p$刻画了回归问题的正则性。我们还证明了$MSE$在紧超参数集上的一致收敛性，并表明其对长度尺度、核尺度及噪声方差的导数渐近趋于零，且给出了明确的收敛速率。这解释了$GPnn$在超参数调优中表现出的鲁棒性。上述结果为$NNGP/GPnn$提供了严谨的统计学基础，使其成为完整$GP$模型的一种高度可扩展且具有理论依据的替代方法。

摘要 (Abstract)

Gaussian process ($GP$) regression is a widely used non-parametric modeling tool, but its cubic complexity in the training size limits its use on massive data sets. A practical remedy is to predict using only the nearest neighbours of each test point, as in Nearest Neighbour Gaussian Process ($NNGP$) regression for geospatial problems and the related scalable $GPnn$ method for more general machine-learning applications. Despite their strong empirical performance, the large-$n$ theory of $NNGP/GPnn$ remains incomplete. We develop a theoretical framework for $NNGP$ and $GPnn$ regression. Under mild regularity assumptions, we derive almost sure pointwise limits for three key predictive criteria: mean squared error ($MSE$), calibration coefficient ($CAL$), and negative log-likelihood ($NLL$). We then study the $L_2$-risk, prove universal consistency, and show that the risk attains Stone’s minimax rate $n^{-2α/(2p+d)}$, where $α$ and $p$ capture regularity of the regression problem. We also prove uniform convergence of $MSE$ over compact hyper-parameter sets and show that its derivatives with respect to lengthscale, kernel scale, and noise variance vanish asymptotically, with explicit rates. This explains the observed robustness of $GPnn$ to hyper-parameter tuning. These results provide a rigorous statistical foundation for $NNGP/GPnn$ as a highly scalable and principled alternative to full $GP$ models.

关键词: Gaussian Process Regression, Nearest Neighbour Gaussian Process, Scalability, Statistical Theory, Consistency, Risk Minimax Rate, Hyper-parameter Robustness, Large-n Theory

259. ❌ Tracking Adaptation Time: Metrics for Temporal Distribution Shift

作者: Lorenzo Iovine, Giacomo Ziffer, Emanuele Della Valle 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07266v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是模型在时间分布偏移下的评估指标问题，属于机器学习模型评估的一般性方法研究。论文摘要中完全没有提及大模型、深度学习技术原理、AI for Science等关键词相关的具体技术或应用。所有关键词都涉及大模型技术、深度学习创新或特定科学AI应用，而本文讨论的是通用的模型评估指标问题，与这些特定技术领域完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了三种新的评估指标，用于区分模型在时间分布偏移下是未能适应数据变化还是数据本身变得更具挑战性，从而更准确地理解模型在动态环境中的鲁棒性。

摘要翻译

评估时间分布漂移下的模型鲁棒性仍是一个开放挑战。现有指标主要量化性能的平均下降程度，却未能捕捉模型如何适应动态变化的数据。这导致对时间性性能退化的误读：当准确率下降时，我们难以判断究竟是模型未能适应数据变化，还是数据本身的内在学习难度已增加。本研究提出三项互补性指标，以区分模型适应能力与数据内在难度。这些指标共同构建了时间分布漂移下模型行为的动态可解释视图。实验结果表明，我们的指标能够揭示现有分析方法所掩盖的适应模式，从而为动态演化环境中的时间鲁棒性提供更丰富的理解。

摘要 (Abstract)

Evaluating robustness under temporal distribution shift remains an open challenge. Existing metrics quantify the average decline in performance, but fail to capture how models adapt to evolving data. As a result, temporal degradation is often misinterpreted: when accuracy declines, it is unclear whether the model is failing to adapt or whether the data itself has become inherently more challenging to learn. In this work, we propose three complementary metrics to distinguish adaptation from intrinsic difficulty in the data. Together, these metrics provide a dynamic and interpretable view of model behavior under temporal distribution shift. Results show that our metrics uncover adaptation patterns hidden by existing analysis, offering a richer understanding of temporal robustness in evolving environments.

关键词: temporal distribution shift, adaptation metrics, model robustness, evaluation metrics, temporal degradation, intrinsic difficulty, dynamic environments

260. ❌ A comparative analysis of machine learning models in SHAP analysis

作者: Justin Lin, Julia Fukuyama 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07258v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究SHAP（SHapley Additive exPlanations）分析在不同机器学习模型上的应用和比较，并提出了多分类问题中瀑布图的新泛化方法。其核心是传统机器学习模型（如随机森林、梯度提升等）的可解释性方法研究，而非大语言模型或深度学习技术。所有关键词均围绕大模型、深度学习技术原理及其应用，与论文主题无关，因此除’Mechanistic Interpretability OR Explainable AI’（因SHAP属于可解释AI范畴）得5分外，其余均得0分。

!!! tip deepseek-chat TL;DR

该论文研究了SHAP分析在不同机器学习模型和数据集上的应用差异，并提出了一种适用于多分类问题的新型瀑布图泛化方法。

摘要翻译

在数据与技术日益发展的时代，大型黑箱模型因其处理海量数据并学习极其复杂数据模式的能力而逐渐成为主流。然而，这些方法的缺陷在于无法解释预测过程，使其在高风险场景中缺乏可信度且应用存在隐患。SHapley可加性解释（SHAP）分析作为一种可解释人工智能方法，因其能够依据原始特征解释模型预测而日益受到关注。对于数据集中的每个样本和特征，其关联的SHAP值可量化该特征对该样本预测结果的贡献度。通过对这些SHAP值的分析，可以深入理解模型的决策过程，进而推动数据驱动解决方案的构建。然而，SHAP值的解释依赖于具体模型，因此并不存在通用的分析流程。为推进相关研究，本文对不同机器学习模型和数据集上的SHAP分析进行了系统性探究。通过揭示SHAP分析背后的细节与细微差异，我们期望为这一尚待深入探索的领域的研究者提供支持。此外，本文还提出了一种针对多分类问题的瀑布图创新性推广方法。

摘要 (Abstract)

In this growing age of data and technology, large black-box models are becoming the norm due to their ability to handle vast amounts of data and learn incredibly complex data patterns. The deficiency of these methods, however, is their inability to explain the prediction process, making them untrustworthy and their use precarious in high-stakes situations. SHapley Additive exPlanations (SHAP) analysis is an explainable AI method growing in popularity for its ability to explain model predictions in terms of the original features. For each sample and feature in the data set, an associated SHAP value quantifies the contribution of that feature to the prediction of that sample. Analysis of these SHAP values provides valuable insight into the model’s decision-making process, which can be leveraged to create data-driven solutions. The interpretation of these SHAP values, however, is model-dependent, so there does not exist a universal analysis procedure. To aid in these efforts, we present a detailed investigation of SHAP analysis across various machine learning models and data sets. In uncovering the details and nuance behind SHAP analysis, we hope to empower analysts in this less-explored territory. We also present a novel generalization of the waterfall plot to the multi-classification problem.

关键词: SHAP analysis, explainable AI, machine learning models, model interpretability, waterfall plot, multi-classification, feature importance, black-box models

261. ❌ Weaves, Wires, and Morphisms: Formalizing and Implementing the Algebra of Deep Learning

作者: Vincent Abbott, Gioele Zardini 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07242v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 这篇论文提出了一个用于深度学习模型的范畴论框架，专注于形式化模型架构的数学描述，包括广播操作和组合性。然而，所有评分关键词都特定于大语言模型（LLMs）及其相关技术（如训练方法、推理优化、对齐、应用等）。论文虽然涉及深度学习，但完全没有讨论LLMs、基础模型或任何评分关键词中提到的具体技术。它关注的是深度学习模型的通用数学框架，而不是LLMs或评分关键词中的任何特定主题。因此，所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

这篇论文提出了一个基于范畴论的深度学习模型形式化框架，通过引入轴步幅和数组广播范畴来精确表达和操作架构的数学函数，并提供了Python和TypeScript实现来展示其通用性。

摘要翻译

尽管深度学习模型运行的是定义明确的数学函数，我们仍缺乏一个用于描述模型架构的正式数学框架。临时性的符号、图示和伪代码难以有效处理非线性广播现象，以及单个组件与复合模型之间的关系。本文提出了一种用于深度学习模型的范畴论框架，通过新颖的轴步长范畴和数组广播范畴将广播操作形式化。这使得架构背后的数学函数能够以组合方式被精确表达和操作。这些数学定义被转化为便于人工管理的图示和便于机器管理的数据结构。我们提供了Python（pyncd）和TypeScript（tsncd）的镜像实现，以展示本框架的普适性，同时展示了包括代数构造、图转换、PyTorch编译和图示渲染在内的功能。这为系统化、形式化的深度学习模型设计与分析奠定了基础。

摘要 (Abstract)

Despite deep learning models running well-defined mathematical functions, we lack a formal mathematical framework for describing model architectures. Ad-hoc notation, diagrams, and pseudocode poorly handle nonlinear broadcasting and the relationship between individual components and composed models. This paper introduces a categorical framework for deep learning models that formalizes broadcasting through the novel axis-stride and array-broadcasted categories. This allows the mathematical function underlying architectures to be precisely expressed and manipulated in a compositional manner. These mathematical definitions are translated into human manageable diagrams and machine manageable data structures. We provide a mirrored implementation in Python (pyncd) and TypeScript (tsncd) to show the universal aspect of our framework, along with features including algebraic construction, graph conversion, PyTorch compilation and diagram rendering. This lays the foundation for a systematic, formal approach to deep learning model design and analysis.

关键词: categorical framework, deep learning models, broadcasting, mathematical formalization, model architectures, compositional models, axis-stride category, array-broadcasted category

262. ❌ How Does Machine Learning Manage Complexity?

作者: Lance Fortnow 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07233v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文从计算复杂性理论的角度分析机器学习模型（特别是其建模复杂系统的能力），属于机器学习理论的基础研究。论文内容聚焦于计算复杂性、可计算分布、P/poly类、密码学伪随机生成器等理论概念，并未涉及任何具体的大模型技术（如LLM、MoE、SFT、RLHF等）、应用领域（如AI for Science）或工程优化技术（如量化、推理加速）。所有关键词均与论文的理论性、抽象性内容无直接关联，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文从计算复杂性理论视角研究机器学习模型如何通过建模可计算分布来管理复杂性，并证明如果模型能最小化与密码学伪随机生成器分布的误差，则其输出分布必然接近均匀分布。

摘要翻译

我们通过计算复杂性的视角来理解机器学习模型的能力，特别是其建模复杂系统的潜力。机器学习模型通常在来自可抽样或更复杂分布的数据上进行训练，这些分布的范围远不止于可计算分布。通过聚焦于可计算分布，机器学习模型能够借助概率更有效地管理复杂性。我们抽象掉具体的学习机制，将机器学习建模为生成具有多项式有界最大熵的P/poly可计算分布。
我们通过以下例证说明学习可计算分布如何建模复杂性：若一个机器学习模型产生的分布$μ$能最小化与密码学伪随机生成器所生成分布之间的误差，则$μ$必然接近均匀分布。

摘要 (Abstract)

We provide a computational complexity lens to understand the power of machine learning models, particularly their ability to model complex systems. Machine learning models are often trained on data drawn from sampleable or more complex distributions, a far wider range of distributions than just computable ones. By focusing on computable distributions, machine learning models can better manage complexity via probability. We abstract away from specific learning mechanisms, modeling machine learning as producing P/poly-computable distributions with polynomially-bounded max-entropy. We illustrate how learning computable distributions models complexity by showing that if a machine learning model produces a distribution $μ$ that minimizes error against the distribution generated by a cryptographic pseudorandom generator, then $μ$ must be close to uniform.

关键词: computational complexity, machine learning models, complex systems, computable distributions, P/poly, pseudorandom generator, max-entropy, error minimization

263. ❌ Diffusion Processes on Implicit Manifolds

作者: Victor Kawasaki-Borruat, Clara Grotehans, Pierre Vandergheynst, Adam Gosztolai 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07213v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究隐式流形上的扩散过程，属于微分几何、随机过程和生成建模的交叉领域，与所有评分关键词（均聚焦大模型/深度学习技术原理及应用）完全无关，无任何匹配内容。

!!! tip deepseek-chat TL;DR

该论文研究了如何仅使用点云数据在隐式流形上构建扩散过程，提出了数据驱动的SDE方法并证明了其收敛性，为流形感知的采样和生成建模提供了理论基础。

摘要翻译

高维数据常被建模为分布于低维流形附近。我们研究如何在隐式设定下于该数据流形上构建扩散过程。即仅使用点云样本，且无需借助坐标图、投影或其他几何基元。我们的主要贡献是提出一种数据驱动的随机微分方程（SDE），该方程在环境空间中定义，同时能捕捉底层流形的内在扩散特性。该构建方法依赖于从数据构建的邻近图中估计扩散过程的无穷小生成元及其卡雷-杜尚算子（CDC）。生成元与CDC共同编码了目标扩散过程的局部随机性与几何结构。我们证明，随着样本数量增加，所导出的过程在概率路径空间上依分布收敛于其光滑流形对应物。我们将此构建方法称为隐式流形值扩散（IMDs），并进一步提出使用欧拉-丸山积分法的数值模拟流程。这为数据流形上扩散动力学的实际实现提供了严格基础，并为流形感知的采样、探索与生成建模开辟了新方向。

摘要 (Abstract)

High-dimensional data are often modeled as lying near a low-dimensional manifold. We study how to construct diffusion processes on this data manifold in the implicit setting. That is, using only point cloud samples and without access to charts, projections, or other geometric primitives. Our main contribution is a data-driven SDE that captures intrinsic diffusion on the underlying manifold while being defined in ambient space. The construction relies on estimating the diffusion’s infinitesimal generator and its carré-du-champ (CDC) from a proximity graph built from the data. The generator and CDC together encode the local stochastic and geometric structure of the intended diffusion. We show that, as the number of samples grows, the induced process converges in law on the space of probability paths to its smooth manifold counterpart. We call this construction Implicit Manifold-valued Diffusions (IMDs), and furthermore present a numerical simulation procedure using Euler-Maruyama integration. This gives a rigorous basis for practical implementations of diffusion dynamics on data manifolds, and opens new directions for manifold-aware sampling, exploration, and generative modeling.

关键词: diffusion processes, implicit manifolds, data-driven SDE, infinitesimal generator, carré-du-champ, manifold-aware sampling, generative modeling, Euler-Maruyama integration

264. ❌ Beyond the Mean: Modelling Annotation Distributions in Continuous Affect Prediction

作者: Kosmas Pinitas, Ilias Maglogiannis 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07198v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究情感计算中的连续情感预测，提出基于Beta分布的标注共识建模框架，属于传统机器学习/深度学习在情感分析领域的应用。所有评分关键词均聚焦于大模型（LLM）相关技术（如MoE、RLHF、RAG、量化等）或大模型在科学领域的应用，而本文未涉及任何大模型技术，也未提及生物信息学或化学信息学等AI for Science子领域，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对连续情感预测中标注主观性和不确定性被忽略的问题，提出了基于Beta分布的建模框架，能够同时预测情感标注的均值、标准差及高阶分布特征，在SEWA和RECOLA数据集上验证了其能有效匹配经验标注分布并保持与传统回归方法相当的性能。

摘要翻译

情感标注本质上是主观且认知要求高的任务，其产生的信号反映的是标注者之间多样化的感知，而非单一的标准答案。在连续情感预测中，这种差异性通常被简化为均值或中位数等点估计，从而丢弃了关于标注者分歧与不确定性的宝贵信息。在本研究中，我们提出一种分布感知框架，利用Beta分布对标注共识进行建模。模型不再预测单一的情感值，而是估计标注分布的均值与标准差，并通过矩匹配将其转化为有效的Beta分布参数。该模型能够以闭式解的形式恢复高阶分布描述符，包括偏度、峰度和分位数。因此，模型不仅能捕捉情感感知的中心趋势，还能反映标注者反应的变异性、不对称性和不确定性。我们使用多模态特征在SEWA和RECOLA数据集上评估了所提出的方法。实验结果表明，基于Beta分布的建模所产生的预测分布与经验标注分布高度吻合，同时在与传统回归方法的性能比较中表现出竞争力。这些发现凸显了在情感计算中建模标注不确定性的重要性，并展示了分布感知学习在主观信号分析中的潜力。

摘要 (Abstract)

Emotion annotation is inherently subjective and cognitively demanding, producing signals that reflect diverse perceptions across annotators rather than a single ground truth. In continuous affect prediction, this variability is typically collapsed into point estimates such as the mean or median, discarding valuable information about annotator disagreement and uncertainty. In this work, we propose a distribution-aware framework that models annotation consensus using the Beta distribution. Instead of predicting a single affect value, models estimate the mean and standard deviation of the annotation distribution, which are transformed into valid Beta parameters through moment matching. This formulation enables the recovery of higher-order distributional descriptors, including skewness, kurtosis, and quantiles, in closed form. As a result, the model captures not only the central tendency of emotional perception but also variability, asymmetry, and uncertainty in annotator responses. We evaluate the proposed approach on the SEWA and RECOLA datasets using multimodal features. Experimental results show that Beta-based modelling produces predictive distributions that closely match the empirical annotator distributions while achieving competitive performance with conventional regression approaches. These findings highlight the importance of modelling annotation uncertainty in affective computing and demonstrate the potential of distribution-aware learning for subjective signal analysis.

关键词: continuous affect prediction, annotation distribution, Beta distribution, uncertainty modelling, affective computing, multimodal features, distribution-aware learning, subjective signal analysis

265. ❌ Splats under Pressure: Exploring Performance-Energy Trade-offs in Real-Time 3D Gaussian Splatting under Constrained GPU Budgets

作者: Muhammad Fahim Tajwar, Arthur Wuhrlin, Bhojan Anand 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07177v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究3D高斯泼溅（3DGS）在边缘设备上的实时渲染性能与能耗权衡，属于计算机图形学、边缘计算和系统优化领域，与所有评分关键词（均围绕大模型/深度学习技术原理、训练方法、推理优化、对齐、应用等）完全无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文通过GPU频率降频和功耗限制模拟不同能力层级的GPU，研究了3D高斯泼溅在边缘设备上的实时渲染性能与能耗权衡，分析了帧率、功耗和能效关系，为边缘部署提供了性能-能耗权衡的早期见解。

摘要翻译

本研究探讨了在不同高斯点云数量与GPU计算预算条件下，在边缘客户端实现实时三维高斯点云渲染（3DGS）的可行性。我们未采用评估多台物理设备的方法，而是基于单台高端GPU通过仿真手段模拟不同性能层级的GPU能力。通过系统性地降低GPU核心频率并施加功耗限制，我们模拟出可表征不同GPU能力层级的可控浮点性能范围。在此范围内的每个性能点上，我们测量了不同场景复杂度、渲染管线及优化方案下的帧率、运行时行为与功耗，从而能够分析功耗-性能关系，例如帧率-功耗曲线、单帧能耗及每瓦性能。该方法使我们能够近似模拟从嵌入式/移动级设备到高端消费级系统在内的多种GPU性能边界。
本研究旨在探索客户端三维高斯点云渲染的实际性能下限，并评估其在能源受限环境（包括独立头显设备与瘦客户端）中的部署潜力。通过此项分析，我们为边缘部署三维高斯点云渲染系统的可行性所涉及的性能-能耗权衡关系提供了早期见解。

摘要 (Abstract)

We investigate the feasibility of real-time 3D Gaussian Splatting (3DGS) rasterisation on edge clients with varying Gaussian splat counts and GPU computational budgets. Instead of evaluating multiple physical devices, we adopt an emulation-based approach that approximates different GPU capability tiers on a single high-end GPU. By systematically under-clocking the GPU core frequency and applying power caps, we emulate a controlled range of floating-point performance levels that approximate different GPU capability tiers. At each point in this range, we measure frame rate, runtime behaviour, and power consumption across scenes of varying complexity, pipelines, and optimisations, enabling analysis of power-performance relationships such as FPS-power curves, energy per frame, and performance per watt. This method allows us to approximate the performance envelope of a diverse class of GPUs, from embedded and mobile-class devices to high-end consumer-grade systems. Our objective is to explore the practical lower bounds of client-side 3DGS rasterisation and assess its potential for deployment in energy-constrained environments, including standalone headsets and thin clients. Through this analysis, we provide early insights into the performance-energy trade-offs that govern the viability of edge-deployed 3DGS systems.

关键词: 3D Gaussian Splatting, real-time rendering, edge computing, GPU emulation, performance-energy trade-offs, power consumption, frame rate, energy-constrained environments

266. ❌ Improving Semantic Uncertainty Quantification in Language Model Question-Answering via Token-Level Temperature Scaling

作者: Tom A. Lamb, Desi R. Ivanova, Philip H. S. Torr, Tim G. J. Rudner 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07172v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于语言模型的语义不确定性量化校准方法，特别是通过温度缩放技术改进问答任务中的置信度分布。论文直接涉及语言模型（LLMs）的校准问题，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（8分）。其他关键词如MoE、SLMs、Scaling Laws、各种训练方法（预训练、微调、对齐）、推理优化、代理系统、模型压缩、幻觉缓解、科学AI应用等，论文均未涉及或提及，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对语言模型在问答任务中语义不确定性量化的校准不足问题，提出通过优化单一标量温度进行温度缩放的方法，有效改善了语义校准、区分度和下游熵性能。

摘要翻译

校准是确保语义不确定性量化可靠性的核心，然而先前的研究主要集中于区分能力，而忽视了校准。由于校准与区分捕捉了不确定性的不同维度，仅关注区分能力会得到不完整的评估图景。我们通过系统性地评估多种置信度度量方法在这两个方面的表现，来填补这一空白。研究表明，当前方法（尤其是固定温度启发式策略）会产生系统性校准错误且区分能力较差的语义置信度分布。我们证明，优化单一标量温度参数——我们认为这提供了合适的归纳偏置——是一种出奇简单却有效的解决方案。我们详尽的评估证实，温度缩放法能持续改善语义校准、区分能力及下游任务中的熵表现，在问答任务中优于启发式基线方法和表达能力更强的词元级再校准方法。

摘要 (Abstract)

Calibration is central to reliable semantic uncertainty quantification, yet prior work has largely focused on discrimination, neglecting calibration. As calibration and discrimination capture distinct aspects of uncertainty, focusing on discrimination alone yields an incomplete picture. We address this gap by systematically evaluating both aspects across a broad set of confidence measures. We show that current approaches, particularly fixed-temperature heuristics, produce systematically miscalibrated and poorly discriminative semantic confidence distributions. We demonstrate that optimising a single scalar temperature, which, we argue, provides a suitable inductive bias, is a surprisingly simple yet effective solution. Our exhaustive evaluation confirms that temperature scaling consistently improves semantic calibration, discrimination, and downstream entropy, outperforming both heuristic baselines and more expressive token-level recalibration methods on question-answering tasks.

关键词: semantic uncertainty quantification, calibration, language models, temperature scaling, question-answering, confidence measures, discrimination, token-level recalibration

267. ❌ Smart Commander: A Hierarchical Reinforcement Learning Framework for Fleet-Level PHM Decision Optimization

作者: Yong Si, Mingfei Lu, Jing Li, Yang Hu, Guijiang Li, Yueheng Song, Zhaokui Wang 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07171v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于军事航空PHM中的分层强化学习（HRL）框架，用于优化维护和物流决策，未涉及大模型、深度学习技术原理或科学AI应用，与所有关键词均无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Smart Commander的分层强化学习框架，用于解决大规模军事航空机队中因维度灾难、稀疏反馈和随机任务配置带来的决策挑战，显著优于传统深度强化学习和基于规则的方法，在训练时间、可扩展性和鲁棒性方面表现优异。

摘要翻译

军事航空预测与健康管理（PHM）中的决策面临重大挑战，这源于大规模机队运行中的“维数灾难”，以及稀疏反馈与随机任务剖面的共同影响。为解决这些问题，本文提出Smart Commander，一种新颖的分层强化学习（HRL）框架，旨在优化序列化维护与后勤决策。该框架将复杂控制问题分解为双层结构：战略层面的总指挥官管理机队级的可用性与成本目标，而战术层面的作战指挥官则执行具体的出动生成、维护调度与资源分配行动。所提方法在一个定制构建的高保真离散事件仿真环境中得到验证，该环境能捕捉飞机配置与保障后勤的动态特性。通过将分层奖励塑形与规划增强的神经网络相结合，该方法有效应对了奖励稀疏与延迟的难题。实证评估表明，Smart Commander显著优于传统的单体深度强化学习（DRL）及基于规则的基线方法。值得注意的是，它在显著减少训练时间的同时，在易发生故障的环境中展现出卓越的可扩展性与鲁棒性。这些结果凸显了HRL作为下一代智能机队管理可靠范式的潜力。

摘要 (Abstract)

Decision-making in military aviation Prognostics and Health Management (PHM) faces significant challenges due to the “curse of dimensionality” in large-scale fleet operations, combined with sparse feedback and stochastic mission profiles. To address these issues, this paper proposes Smart Commander, a novel Hierarchical Reinforcement Learning (HRL) framework designed to optimize sequential maintenance and logistics decisions. The framework decomposes the complex control problem into a two-tier hierarchy: a strategic General Commander manages fleet-level availability and cost objectives, while tactical Operation Commanders execute specific actions for sortie generation, maintenance scheduling, and resource allocation. The proposed approach is validated within a custom-built, high-fidelity discrete-event simulation environment that captures the dynamics of aircraft configuration and support logistics.By integrating layered reward shaping with planning-enhanced neural networks, the method effectively addresses the difficulty of sparse and delayed rewards. Empirical evaluations demonstrate that Smart Commander significantly outperforms conventional monolithic Deep Reinforcement Learning (DRL) and rule-based baselines. Notably, it achieves a substantial reduction in training time while demonstrating superior scalability and robustness in failure-prone environments. These results highlight the potential of HRL as a reliable paradigm for next-generation intelligent fleet management.

关键词: Hierarchical Reinforcement Learning, Prognostics and Health Management, Fleet Management, Decision Optimization, Maintenance Scheduling, Resource Allocation, Discrete-event Simulation, Scalability

268. ❌ Amortized Filtering and Smoothing with Conditional Normalizing Flows

作者: Tiangang Cui, Xiaodong Feng, Chenlong Pei, Xiaoliang Wan, Tao Zhou 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07169v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究贝叶斯滤波和平滑问题，提出了一种基于条件归一化流的摊销框架（AFSF），属于科学计算和概率推理领域。论文内容与绝大多数关键词（涉及大模型技术、训练方法、推理优化、对齐、智能体等）完全无关，因为这些关键词特指大语言模型（LLM）及相关技术。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文应用于科学和工程领域的高维非线性动力系统，属于AI for Science的广义范畴，但论文本身不涉及生物信息学或化学信息学，且核心是概率方法而非典型的大模型应用，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文针对高维非线性动力系统中的贝叶斯滤波和平滑问题，提出了一个基于条件归一化流的统一摊销框架（AFSF），能够准确近似滤波分布和平滑路径，并支持超出训练范围的预测。

摘要翻译

贝叶斯滤波与平滑处理高维非线性动力系统是科学与工程诸多领域基础且具挑战性的问题。本研究提出AFSF——一种基于条件归一化流的滤波与平滑统一化摊销框架。其核心思想是将每个观测历史编码为固定维度的摘要统计量，并利用这一共享表示同时学习用于滤波分布的前向流与用于后向转移核的反向流。具体而言，循环编码器将每个观测历史映射为固定维度的摘要统计量，其维度不依赖于时间序列的长度。以此共享摘要统计量为条件，前向流逼近滤波分布，反向流则逼近后向转移核。随后通过标准后向递归，将终端滤波分布与习得的反向流结合，即可恢复整个轨迹的平滑分布。通过学习潜在的时间演化结构，AFSF还支持超越训练时间范围的推演。此外，通过共享摘要统计量耦合两个流，AFSF在潜在状态轨迹间引入了隐式正则化，从而提升了轨迹层面的平滑效果。我们还开发了基于流的粒子滤波变体，该变体提供了另一种滤波流程，并能在显式模型因子可用时实现基于有效样本量（ESS）的诊断。数值实验表明，AFSF能够对滤波分布与平滑路径提供精确的逼近。

摘要 (Abstract)

Bayesian filtering and smoothing for high-dimensional nonlinear dynamical systems are fundamental yet challenging problems in many areas of science and engineering. In this work, we propose AFSF, a unified amortized framework for filtering and smoothing with conditional normalizing flows. The core idea is to encode each observation history into a fixed-dimensional summary statistic and use this shared representation to learn both a forward flow for the filtering distribution and a backward flow for the backward transition kernel. Specifically, a recurrent encoder maps each observation history to a fixed-dimensional summary statistic whose dimension does not depend on the length of the time series. Conditioned on this shared summary statistic, the forward flow approximates the filtering distribution, while the backward flow approximates the backward transition kernel. The smoothing distribution over an entire trajectory is then recovered by combining the terminal filtering distribution with the learned backward flow through the standard backward recursion. By learning the underlying temporal evolution structure, AFSF also supports extrapolation beyond the training horizon. Moreover, by coupling the two flows through shared summary statistics, AFSF induces an implicit regularization across latent state trajectories and improves trajectory-level smoothing. In addition, we develop a flow-based particle filtering variant that provides an alternative filtering procedure and enables ESS-based diagnostics when explicit model factors are available. Numerical experiments demonstrate that AFSF provides accurate approximations of both filtering distributions and smoothing paths.

关键词: Bayesian filtering, smoothing, normalizing flows, high-dimensional dynamical systems, amortized inference, particle filtering, state estimation, time series analysis

269. ❌ SBBTS: A Unified Schrödinger-Bass Framework for Synthetic Financial Time Series

作者: Alexandre Alouadi, Grégoire Loeper, Célian Marsala, Othmane Mazhar, Huyên Pham 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07159v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于金融时间序列生成，提出了一种基于Schrödinger-Bass Bridge的扩散模型框架，用于联合校准漂移和波动率。虽然属于机器学习在金融领域的应用，但研究内容与所有评分关键词（均围绕大模型、深度学习技术原理及其应用）无直接关联。论文未涉及大模型、语言模型、模型训练/对齐/推理优化、代理系统或AI for Science等主题，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为SBBTS的统一框架，用于生成同时再现边际分布和时间动态的合成金融时间序列，实验表明该方法能准确恢复随机波动率和相关性参数，并在S&P 500数据上通过数据增强提升了下游预测性能。

摘要翻译

我们研究生成能同时复现边缘分布与时间动态的合成时间序列问题，这是金融机器学习中的核心挑战。现有方法通常难以联合建模漂移项与随机波动率：基于扩散的方法固定波动率，而鞅传输模型则忽略漂移项。我们提出用于时间序列的薛定谔-巴斯桥模型（Schrödinger-Bass Bridge for Time Series, SBBTS），这是一个将薛定谔-巴斯公式扩展至多步时间序列的统一框架。该方法构建了一个联合校准漂移项与波动率的扩散过程，并可分解为可处理的条件传输问题，从而实现高效学习。在赫斯顿模型（Heston model）上的数值实验表明，SBBTS能准确恢复先前薛定谔桥方法无法捕捉的随机波动率与相关参数。应用于标普500数据时，SBBTS生成的合成时间序列在用于数据增强时，持续提升下游预测性能，与仅使用真实数据训练相比，获得了更高的分类准确率和夏普比率（Sharpe ratio）。这些结果表明，SBBTS为金融应用中的真实时间序列生成与数据增强提供了一个实用且有效的框架。

摘要 (Abstract)

We study the problem of generating synthetic time series that reproduce both marginal distributions and temporal dynamics, a central challenge in financial machine learning. Existing approaches typically fail to jointly model drift and stochastic volatility, as diffusion-based methods fix the volatility while martingale transport models ignore drift. We introduce the Schrödinger-Bass Bridge for Time Series (SBBTS), a unified framework that extends the Schrödinger-Bass formulation to multi-step time series. The method constructs a diffusion process that jointly calibrates drift and volatility and admits a tractable decomposition into conditional transport problems, enabling efficient learning. Numerical experiments on the Heston model demonstrate that SBBTS accurately recovers stochastic volatility and correlation parameters that prior SchrödingerBridge methods fail to capture. Applied to S&P 500 data, SBBTS-generated synthetic time series consistently improve downstream forecasting performance when used for data augmentation, yielding higher classification accuracy and Sharpe ratio compared to real-data-only training. These results show that SBBTS provides a practical and effective framework for realistic time series generation and data augmentation in financial applications.

关键词: synthetic time series, financial machine learning, Schrödinger-Bass Bridge, drift and volatility calibration, data augmentation, Heston model, S&P 500, forecasting performance

270. ❌ Lumbermark: Resistant Clustering by Chopping Up Mutual Reachability Minimum Spanning Trees

作者: Marek Gagolewski 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07143v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《Lumbermark: Resistant Clustering by Chopping Up Mutual Reachability Minimum Spanning Trees》提出了一种鲁棒的分裂聚类算法，专注于数据聚类方法（特别是基于互可达距离最小生成树的聚类），与深度学习、大模型及其相关技术（如预训练、微调、推理优化、对齐、代理系统等）完全无关。所有关键词均涉及大模型或深度学习技术，而本文属于传统机器学习中的聚类算法研究，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Lumbermark的鲁棒分裂聚类算法，通过迭代切割数据集互可达最小生成树中的突出部分来检测不同大小、密度和形状的聚类，并在基准数据上表现良好。

摘要翻译

本文介绍Lumbermark，一种能够检测不同尺寸、密度和形状簇的鲁棒分裂式聚类算法。Lumbermark通过迭代切割由数据集的互达距离最小生成树（mutual reachability minimum spanning tree）中突出片段连接的大型分支来实现聚类。互达距离的使用平滑了数据分布，并降低了低密度对象（如簇间噪声点或边缘离群值）的影响。该算法可视为HDBSCAN的替代方案，能够生成用户指定规模的划分结果。新方法的快速易用实现已发布于开源Python和R软件包’lumbermark’中。实验表明Lumbermark在基准数据上表现优异，我们期望该算法能为不同领域的数据科学家和实践者提供有效工具。

摘要 (Abstract)

We introduce Lumbermark, a robust divisive clustering algorithm capable of detecting clusters of varying sizes, densities, and shapes. Lumbermark iteratively chops off large limbs connected by protruding segments of a dataset’s mutual reachability minimum spanning tree. The use of mutual reachability distances smoothens the data distribution and decreases the influence of low-density objects, such as noise points between clusters or outliers at their peripheries. The algorithm can be viewed as an alternative to HDBSCAN that produces partitions with user-specified sizes. A fast, easy-to-use implementation of the new method is available in the open-source ’lumbermark’ package for Python and R. We show that Lumbermark performs well on benchmark data and hope it will prove useful to data scientists and practitioners across different fields.

关键词: clustering algorithm, mutual reachability, minimum spanning tree, robust clustering, divisive clustering, HDBSCAN alternative, varying cluster sizes, noise reduction

271. ❌ A solver-in-the-loop framework for end-to-end differentiable coastal hydrodynamics

作者: Elsa Cardoso-Bihlo, Alex Bihlo 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07129v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文《A solver-in-the-loop framework for end-to-end differentiable coastal hydrodynamics》专注于科学机器学习（Scientific Machine Learning）在海岸水动力学中的应用，特别是通过可微分求解器（AegirJAX）解决正演模拟和反演优化问题。论文内容涉及神经网络校正、拓扑优化、循环神经网络训练和反演问题，属于AI在科学领域的应用（AI for Science），因此与关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（评分5分）。然而，论文未涉及大语言模型（LLMs）、深度学习技术原理创新（如MoE、Scaling Laws、PEFT等）、推理方法（如CoT、System 2 Thinking）、代理系统、模型优化技术（如Quantization、Speculative Decoding）或其他指定的大模型相关主题，因此其他关键词均评为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于可微分求解器（AegirJAX）的端到端框架，用于海岸水动力学的正演模拟和反演优化，解决了传统方法在反问题中因离散伴随推导刚性而导致的困难，并在波传播校正、防波堤设计、主动波消除和海底地形反演等任务中展示了其多功能性。

摘要翻译

波浪传播与爬高的数值模拟是海岸工程和海啸灾害评估的基石。然而，由于推导离散伴随模型的僵化性与高昂计算成本，将这些正演模型应用于反问题——如地形估计、震源反演和结构优化——仍然极为困难。本文介绍了AegirJAX，一个基于深度积分、非静水压浅水方程的全可微分水动力求解器。通过将求解器完全实现在反向模式自动微分框架内，AegirJAX将时间推进的物理循环视为连续的计算图。我们通过一系列科学机器学习任务展示了该框架的多功能性：（1）在高度频散的波浪传播中，针对模型误设发现特定区域（regime-specific）的神经校正项；（2）执行防波堤设计的连续拓扑优化；（3）在回路中训练循环神经网络以实现主动消波；（4）直接利用下游传感器数据反演隐藏地形与海底滑坡运动学。所提出的可微分范式从根本上模糊了正演模拟与反演优化之间的界限，为海岸水动力学提供了一个统一的端到端框架。

摘要 (Abstract)

Numerical simulation of wave propagation and run-up is a cornerstone of coastal engineering and tsunami hazard assessment. However, applying these forward models to inverse problems, such as bathymetry estimation, source inversion, and structural optimization, remains notoriously difficult due to the rigidity and high computational cost of deriving discrete adjoints. In this paper, we introduce AegirJAX, a fully differentiable hydrodynamic solver based on the depth-integrated, non-hydrostatic shallow-water equations. By implementing the solver entirely within a reverse-mode automatic differentiation framework, AegirJAX treats the time-marching physics loop as a continuous computational graph. We demonstrate the framework’s versatility across a suite of scientific machine learning tasks: (1) discovering regime-specific neural corrections for model misspecifications in highly dispersive wave propagation; (2) performing continuous topology optimization for breakwater design; (3) training recurrent neural networks in-the-loop for active wave cancellation; and (4) inverting hidden bathymetry and submarine landslide kinematics directly from downstream sensor data. The proposed differentiable paradigm fundamentally blurs the line between forward simulation and inverse optimization, offering a unified, end-to-end framework for coastal hydrodynamics.

关键词: differentiable solver, coastal hydrodynamics, scientific machine learning, inverse problems, neural corrections, topology optimization, bathymetry inversion, AegirJAX

272. ❌ DDP-SA: Scalable Privacy-Preserving Federated Learning via Distributed Differential Privacy and Secure Aggregation

作者: Wenjing Wei, Farid Nait-Abdesselam, Alla Jammine 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07125v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于联邦学习中的隐私保护技术（差分隐私和安全聚合），并未涉及大模型、深度学习技术原理、AI科学应用等关键词领域。论文内容与所有评分关键词均无直接关联，因此所有关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个结合分布式差分隐私和安全聚合的联邦学习框架DDP-SA，在保护客户端隐私的同时实现了比单独使用差分隐私或安全多方计算更高的模型精度和更强的隐私保障。

摘要翻译

本文提出DDP-SA，一种可扩展的隐私保护联邦学习框架，该框架联合利用客户端本地差分隐私（Local Differential Privacy, LDP）与全阈值加法秘密共享（Additive Secret Sharing, ASS）以实现安全聚合。与现有仅依赖差分隐私或安全多方计算（Secure Multi-Party Computation, MPC）的方法不同，DDP-SA融合了这两种技术，在保持计算可行性的同时，提供了更强的端到端隐私保障。该框架引入了一种两阶段保护机制：客户端首先使用校准的拉普拉斯噪声扰动其本地梯度，随后将加噪梯度分解为加法秘密份额并分发至多个中间服务器。这一设计确保：（1）任何单一被攻陷的服务器或通信信道均无法泄露关于个体客户端更新的信息；（2）参数服务器仅能重构聚合后的加噪梯度，而无法获知任何特定客户端的贡献。大量实验表明，DDP-SA在提供比纯MPC方法更强隐私保护的同时，实现了比独立LDP方法显著更高的模型精度。所提出的框架参与方数量呈线性扩展，并为联邦学习应用提供了一个计算与通信开销可控的实用化隐私保护解决方案。

摘要 (Abstract)

This article presents DDP-SA, a scalable privacy-preserving federated learning framework that jointly leverages client-side local differential privacy (LDP) and full-threshold additive secret sharing (ASS) for secure aggregation. Unlike existing methods that rely solely on differential privacy or on secure multi-party computation (MPC), DDP-SA integrates both techniques to deliver stronger end-to-end privacy guarantees while remaining computationally practical. The framework introduces a two-stage protection mechanism: clients first perturb their local gradients with calibrated Laplace noise, then decompose the noisy gradients into additive secret shares that are distributed across multiple intermediate servers. This design ensures that (i) no single compromised server or communication channel can reveal any information about individual client updates, and (ii) the parameter server reconstructs only the aggregated noisy gradient, never any client-specific contribution. Extensive experiments show that DDP-SA achieves substantially higher model accuracy than standalone LDP while providing stronger privacy protection than MPC-only approaches. The proposed framework scales linearly with the number of participants and offers a practical, privacy-preserving solution for federated learning applications with controllable computational and communication overhead.

关键词: Federated Learning, Privacy-Preserving, Differential Privacy, Secure Aggregation, Distributed Differential Privacy, Local Differential Privacy, Additive Secret Sharing, Scalable Framework

273. ❌ Are Stochastic Multi-objective Bandits Harder than Single-objective Bandits?

作者: Changkun Guan, Mengfan Xu 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07096v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究多目标赌博机问题，属于经典强化学习/在线学习领域，专注于理论分析和算法设计。论文内容完全不涉及大模型、深度学习、AI for Science或任何评分关键词中的技术概念。所有关键词均与大模型技术原理、训练方法、推理优化、应用领域等相关，而本文是纯理论计算机科学/运筹学研究，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文研究了随机多目标赌博机是否比单目标赌博机更难优化的问题，证明了在随机设置下帕累托遗憾由最大次优间隙控制，并提出了一个达到最优帕累托遗憾阶的算法。

摘要翻译

多目标赌博机因其广泛的适用性与数学上的优雅性而日益受到关注，其中每个臂的奖励是多维向量而非标量。这自然引入了帕累托序关系与帕累托遗憾。该领域一个长期存在的问题是：由于这种增加的复杂性，性能优化是否在根本上变得更加困难。近期一项令人惊讶的结果表明，在对抗性环境中，帕累托遗憾并不大于经典遗憾；然而，在随机环境中，由于遗憾定义不同，情况仍不明确。事实上，现有研究暗示随机情况下的帕累托遗憾会随维度增加而增长。这一存在争议却又微妙的现象引出了我们的核心问题：多目标赌博机是否真的比单目标赌博机更难优化？ 我们通过证明在随机设置中，帕累托遗憾实际上由最大次优间隙 (g^\dagger) 主导，因此其最小边际遗憾阶为 (Ω(\frac{K\log T}{g^\dagger}))，从而完整回答了该问题。我们进一步提出一种新算法，其帕累托遗憾阶为 (O(\frac{K\log T}{g^\dagger}))，从而达到了最优。该算法通过上置信界与下置信界估计器，在臂与目标两个层面上实现了嵌套的双层不确定性量化。它将用于臂选择的“前两名竞速”策略与用于维度选择的“不确定性贪婪”规则相结合，使这两个组件共同平衡了双层结构中的探索与利用。我们还进行了全面的数值实验以验证所提算法，结果显示了预期的遗憾保证，并较基准方法取得了显著优势。

摘要 (Abstract)

Multi-objective bandits have attracted increasing attention because of their broad applicability and mathematical elegance, where the reward of each arm is a multi-dimensional vector rather than a scalar. This naturally introduces Pareto order relations and Pareto regret. A long-standing question in this area is whether performance is fundamentally harder to optimize because of this added complexity. A recent surprising result shows that, in the adversarial setting, Pareto regret is no larger than classical regret; however, in the stochastic setting, where the regret notion is different, the picture remains unclear. In fact, existing work suggests that Pareto regret in the stochastic case increases with the dimensionality. This controversial yet subtle phenomenon motivates our central question: \emph{are multi-objective bandits actually harder than single-objective ones?} We answer this question in full by showing that, in the stochastic setting, Pareto regret is in fact governed by the maximum sub-optimality gap (g^\dagger), and hence by the minimum marginal regret of order (Ω(\frac{K\log T}{g^\dagger})). We further develop a new algorithm that achieves Pareto regret of order (O(\frac{K\log T}{g^\dagger})), and is therefore optimal. The algorithm leverages a nested two-layer uncertainty quantification over both arms and objectives through upper and lower confidence bound estimators. It combines a top-two racing strategy for arm selection with an uncertainty-greedy rule for dimension selection. Together, these components balance exploration and exploitation across the two layers. We also conduct comprehensive numerical experiments to validate the proposed algorithm, showing the desired regret guarantee and significant gains over benchmark methods.

关键词: multi-objective bandits, Pareto regret, stochastic setting, optimal algorithm, upper confidence bound, exploration-exploitation, theoretical analysis, regret minimization

274. ❌ Mining Electronic Health Records to Investigate Effectiveness of Ensemble Deep Clustering

作者: Manar D. Samad, Yina Hou, Shrabani Ghosh 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07085v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于电子健康记录（EHR）的聚类分析，特别是针对心力衰竭患者，提出了一种集成深度聚类方法。研究内容属于医疗信息学中的AI应用，与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（评5分），因为涉及生物医学数据分析。然而，论文未涉及大语言模型（LLMs）、深度学习技术原理创新（如MoE、Scaling Laws、PEFT等）、推理方法（如CoT、System 2）、代理系统、模型优化（如Quantization、Speculative Decoding）或其他关键词。所有其他关键词评0分，因为论文的核心是传统与深度聚类方法的结合，而非大模型或相关新技术。

!!! tip deepseek-chat TL;DR

本研究通过分析真实电子健康记录数据，提出了一种集成传统和深度聚类方法的新框架，用于心力衰竭患者的亚型区分，并在14种聚类方法中取得了最佳整体性能。

摘要翻译

在电子健康记录（EHR）中，对患者进行聚类并区分疾病亚型是阐明病理生理学和辅助临床决策的关键任务。然而，医疗信息学中的聚类仍主要基于传统方法（尤其是K-means），当将其应用于自编码器学习到的嵌入表示作为混合方法时，取得的成功有限。本研究利用“全民研究计划”的真实EHR数据，评估了传统方法、混合方法及深度学习方法在心力衰竭患者队列中的聚类效果。传统聚类方法表现稳健，因为深度学习方法主要针对图像聚类任务设计，这与表格型EHR数据场景存在显著差异。为弥补深度聚类的不足，我们提出一种基于集成学习的深度聚类方法，该方法聚合从多个嵌入维度获得的聚类分配结果，而非依赖单一固定嵌入空间。当与传统聚类方法结合于新型集成框架时，所提出的深度聚类集成嵌入方案在14种不同聚类方法和多个患者队列中取得了最佳综合性能排名。本文强调了基于生物性别特异性进行EHR数据聚类的重要性，并论证了传统方法与深度聚类方法相结合相较于单一方法的优势。

摘要 (Abstract)

In electronic health records (EHRs), clustering patients and distinguishing disease subtypes are key tasks to elucidate pathophysiology and aid clinical decision-making. However, clustering in healthcare informatics is still based on traditional methods, especially K-means, and has achieved limited success when applied to embedding representations learned by autoencoders as hybrid methods. This paper investigates the effectiveness of traditional, hybrid, and deep learning methods in heart failure patient cohorts using real EHR data from the All of Us Research Program. Traditional clustering methods perform robustly because deep learning approaches are specifically designed for image clustering, a task that differs substantially from the tabular EHR data setting. To address the shortcomings of deep clustering, we introduce an ensemble-based deep clustering approach that aggregates cluster assignments obtained from multiple embedding dimensions, rather than relying on a single fixed embedding space. When combined with traditional clustering in a novel ensemble framework, the proposed ensemble embedding for deep clustering delivers the best overall performance ranking across 14 diverse clustering methods and multiple patient cohorts. This paper underscores the importance of biological sex-specific clustering of EHR data and the advantages of combining traditional and deep clustering approaches over a single method.

关键词: electronic health records, deep clustering, ensemble methods, heart failure, patient clustering, EHR data, clustering algorithms, bioinformatics

275. ❌ Epistemic Robust Offline Reinforcement Learning

作者: Abhilash Reddy Chenreddy, Erick Delage 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07072v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于离线强化学习（Offline RL）中的认知不确定性（epistemic uncertainty）问题，提出了一种基于不确定性集的鲁棒框架来替代传统的集成方法，并引入了新的基准测试。所有关键词均与大语言模型（LLMs）、深度学习技术原理或AI在科学领域的应用直接相关，而本文的核心是强化学习（特别是离线RL）中的鲁棒优化和不确定性量化，未涉及LLMs、深度学习架构、训练技术、推理优化、AI代理或科学AI应用等主题。因此，所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对离线强化学习中因数据覆盖有限或偏差导致的认知不确定性挑战，提出了一种用紧凑不确定性集替代离散集成的新框架，并在表格和连续状态领域中实现了比基于集成的方法更好的鲁棒性和泛化性能。

摘要翻译

离线强化学习从固定数据集中学习策略，无需与环境进一步交互。该场景下的核心挑战是认知不确定性，这种不确定性源于有限或有偏的数据覆盖，尤其当行为策略系统性地规避某些动作时。认知不确定性可能导致价值估计不准确与泛化不可靠。基于集成的方法（如SAC-N）通过使用集成最小值保守估计Q值来缓解此问题，但这类方法需要大型集成网络，且常混淆认知不确定性与偶然不确定性。为应对这些局限，我们提出一个统一且可泛化的框架，该框架用Q值上的紧凑不确定性集合替代离散的集成网络。我们进一步引入一种基于Epinet的模型，该模型直接构建不确定性集合，以在鲁棒贝尔曼目标下优化累积奖励，而无需依赖集成网络。我们还提出了一个基准测试，用于评估风险敏感行为策略下的离线强化学习算法，并证明我们的方法在表格型与连续状态领域中均比基于集成的基线方法实现了更强的鲁棒性与泛化能力。

摘要 (Abstract)

Offline reinforcement learning learns policies from fixed datasets without further environment interaction. A key challenge in this setting is epistemic uncertainty, arising from limited or biased data coverage, particularly when the behavior policy systematically avoids certain actions. This can lead to inaccurate value estimates and unreliable generalization. Ensemble-based methods like SAC-N mitigate this by conservatively estimating Q-values using the ensemble minimum, but they require large ensembles and often conflate epistemic with aleatoric uncertainty. To address these limitations, we propose a unified and generalizable framework that replaces discrete ensembles with compact uncertainty sets over Q-values. %We further introduce an Epinet based model that directly shapes the uncertainty sets to optimize the cumulative reward under the robust Bellman objective without relying on ensembles. We also introduce a benchmark for evaluating offline RL algorithms under risk-sensitive behavior policies, and demonstrate that our method achieves improved robustness and generalization over ensemble-based baselines across both tabular and continuous state domains.

关键词: Offline Reinforcement Learning, Epistemic Uncertainty, Robust Optimization, Uncertainty Sets, Generalization, Behavior Policy, Benchmark Evaluation, Q-value Estimation

276. ❌ Controller Design for Structured State-space Models via Contraction Theory

作者: Muhammad Zakwan, Vaibhav Gupta, Alireza Karimi, Efe C. Balta, Giancarlo Ferrari-Trecate 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07069v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于结构化状态空间模型（SSMs）在非线性系统控制器设计中的应用，属于控制理论与系统辨识领域。所有评分关键词均与大语言模型、深度学习技术原理或AI科学应用相关，而本文未涉及任何大模型、深度学习、AI科学应用或相关技术（如微调、对齐、推理加速等），因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

本文提出了一种基于结构化状态空间模型的间接数据驱动输出反馈控制器设计方法，通过收缩理论和线性矩阵不等式实现了非线性系统的可扩展控制设计，并建立了SSMs的分离原理。

摘要翻译

本文提出了一种基于结构化状态空间模型作为代理模型的非线性系统间接数据驱动输出反馈控制器综合方法。结构化状态空间模型已成为时间序列数据和动态系统建模中一种引人注目的替代方案。与基于Transformer架构的二次计算复杂度相比，该模型能够捕捉长期依赖关系，同时保持相对于序列长度的线性计算复杂度。本工作的贡献主要体现在三个方面。我们首次对结构化状态空间模型的可控性与可观测性进行了分析，这通过结合收缩理论的线性矩阵不等式实现了可扩展的控制设计。此外，本文建立了结构化状态空间模型的分离原理，使得观测器与状态反馈控制器能够独立设计，同时保证闭环系统的指数稳定性。通过一个数值算例展示了所提框架的有效性，该算例演示了非线性系统辨识与输出反馈控制器的综合过程。

摘要 (Abstract)

This paper presents an indirect data-driven output feedback controller synthesis for nonlinear systems, leveraging Structured State-space Models (SSMs) as surrogate models. SSMs have emerged as a compelling alternative in modelling time-series data and dynamical systems. They can capture long-term dependencies while maintaining linear computational complexity with respect to the sequence length, in comparison to the quadratic complexity of Transformer-based architectures. The contributions of this work are threefold. We provide the first analysis of controllability and observability of SSMs, which leads to scalable control design via Linear Matrix Inequalities (LMIs) that leverage contraction theory. Moreover, a separation principle for SSMs is established, enabling the independent design of observers and state-feedback controllers while preserving the exponential stability of the closed-loop system. The effectiveness of the proposed framework is demonstrated through a numerical example, showcasing nonlinear system identification and the synthesis of an output feedback controller.

关键词: Structured State-space Models, Controller Design, Contraction Theory, Nonlinear Systems, Output Feedback, Linear Matrix Inequalities, System Identification

277. ❌ AdaBoost Does Not Always Cycle: A Computer-Assisted Counterexample

作者: Erik Y. Wang 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07055v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究AdaBoost算法的收敛性问题，属于传统机器学习理论范畴，与所有评分关键词（均围绕大模型、深度学习技术及其应用）完全无关。论文虽提及与GPT-5.4 Pro和Claude Opus 4.6合作开发，但内容本身不涉及这些大模型的技术原理、应用或创新。

!!! tip deepseek-chat TL;DR

该论文通过计算机辅助构造反例，证明了AdaBoost算法在穷举条件下不一定收敛到有限周期，解决了Rudin等人于2012年提出的公开问题。

摘要翻译

我们通过计算机辅助构造了一个反例，回应了Rudin、Schapire与Daubechies在COLT 2012会议上提出的公开问题——穷举式AdaBoost是否总是收敛于有限循环。该构造基于一个块乘积（block-product）结构，其两个因子的5步分支映射共享一个精确的周期-2轨道，但其线性化回归映射的主导特征值具有无理数的对数比。这一无理性质迫使突发获胜序列（burst-winner sequence）具有无理渐近频率，从而排除了最终周期性的可能。所有结论均通过精确有理数运算验证。本项研究是与GPT-5.4 Pro及Claude Opus 4.6协作完成的。

摘要 (Abstract)

We give a computer-assisted counterexample to the open question, posed by Rudin, Schapire, and Daubechies in COLT 2012, of whether exhaustive AdaBoost always converges to a finite cycle. The construction is based on a block-product gadget whose two factors share an exact period-2 orbit for their 5-step branch maps, but whose linearized return maps have dominant eigenvalues with an irrational logarithmic ratio. This irrationality forces the burst-winner sequence to have an irrational asymptotic frequency, precluding eventual periodicity. All assertions are certified by exact rational arithmetic. This work was developed in collaboration with GPT-5.4 Pro and Claude Opus 4.6.

关键词: AdaBoost, counterexample, convergence, periodicity, computer-assisted proof, machine learning theory, exact rational arithmetic

278. ❌ Production-Ready Automated ECU Calibration using Residual Reinforcement Learning

作者: Andreas Kampmeier, Kevin Badalian, Lucas Koch, Sung-Yong Lee, Jakob Andert 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07059v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究的是使用残差强化学习（Residual Reinforcement Learning）进行汽车电子控制单元（ECU）的自动化标定，属于传统强化学习在工业控制领域的应用。论文未涉及任何大语言模型（LLM）、深度学习技术原理创新或大模型在不同领域的应用。所有关键词均与大模型、深度学习技术或AI for Science（生物信息学、化学信息学等）相关，而本文属于传统强化学习在汽车工程中的应用，与这些关键词基本无关。仅’Explainable AI’因论文强调其方法的可解释性（explainable approach）而获得5分（有一定关联），‘AI for Science’因论文属于AI在工程领域的应用而获得5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于残差强化学习的可解释方法，用于自动化汽车电子控制单元的标定过程，实验证明该方法能在硬件在环平台上快速收敛到接近参考标准的标定结果，显著减少时间和人工干预。

摘要翻译

电子控制单元（ECU）在将昔日的汽车转变为当今道路上行驶的现代车辆方面发挥了关键作用。它们主动调节各个组件的执行，从而决定了整个系统的特性。在此过程中，控制功能的行为在很大程度上依赖于其标定参数，而这些参数传统上由工程师手动设计。这一过程正面临着客户期望不断提高、产品开发周期持续缩短的环境。同时，法规要求日益增加，排放标准也日趋严格。考虑到在此基础上车辆变型的数量，传统方法在实践和财务上的可行性正逐渐丧失。先前的研究已证明，强化学习（RL）可自动开发出最优控制功能；但由于所得功能由人工神经网络表示，其缺乏可解释性，这一情况使得它们难以应用于量产车辆。本文提出一种可解释的自动化标定方法，该方法采用遵循成熟汽车开发原则的残差强化学习（residual RL）。我们通过基于硬件在环（HiL）平台的系列控制单元中一个基于图谱的空气路径控制器，验证了该方法的适用性。从次优图谱出发，所提出的方法能快速收敛到与系列ECU中的参考标定高度接近的标定结果。这些结果证明，该方法适用于工业领域，能够在显著更短的时间内获得更优的标定，且几乎无需人工干预。

摘要 (Abstract)

Electronic Control Units (ECUs) have played a pivotal role in transforming motorcars of yore into the modern vehicles we see on our roads today. They actively regulate the actuation of individual components and thus determine the characteristics of the whole system. In this, the behavior of the control functions heavily depends on their calibration parameters which engineers traditionally design by hand. This is taking place in an environment of rising customer expectations and steadily shorter product development cycles. At the same time, legislative requirements are increasing while emission standards are getting stricter. Considering the number of vehicle variants on top of all that, the conventional method is losing its practical and financial viability. Prior work has already demonstrated that optimal control functions can be automatically developed with reinforcement learning (RL); since the resulting functions are represented by artificial neural networks, they lack explainability, a circumstance which renders them challenging to employ in production vehicles. In this article, we present an explainable approach to automating the calibration process using residual RL which follows established automotive development principles. Its applicability is demonstrated by means of a map-based air path controller in a series control unit using a hardware-in-the-loop (HiL) platform. Starting with a sub-optimal map, the proposed methodology quickly converges to a calibration which closely resembles the reference in the series ECU. The results prove that the approach is suitable for the industry where it leads to better calibrations in significantly less time and requires virtually no human intervention

关键词: Residual Reinforcement Learning, ECU Calibration, Automotive Control, Explainable AI, Hardware-in-the-Loop, Air Path Controller, Production Vehicles, Automated Calibration

279. ❌ MoE Routing Testbed: Studying Expert Specialization and Routing Behavior at Small Scale

作者: Tobias Falke, Nicolas Anastassacos, Samson Tan, Chankrisna Richy Meas, Chandana Satya Prakash, Nitesh Sekhar, M Saiful Bari, Krishna Kompella, Gamaleldin F. Elsayed 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07030v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	15.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究稀疏混合专家（MoE）架构的路由行为，与’Mixture of Experts OR MoE OR Sparse Models’高度相关（15分），并明确应用于大型语言模型（LLMs），与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。其他关键词如SLMs、Scaling Laws、Pre-training、Fine-tuning、Alignment、RAG、Reasoning、Agents、Compression、AI for Science等均未在论文中涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对稀疏混合专家（MoE）模型在训练中因路由复杂性导致的专家专业化评估难题，提出了MoE路由测试平台，通过可量化测量发现平衡范围是实现专家专业化同时保持高利用率的關鍵因素，且这一结论可推广至更大规模模型。

摘要翻译

稀疏专家混合模型架构在尖端大语言模型中日益流行，但由于路由复杂性，其引入了训练挑战。要充分利用MoE模型的参数，需要所有专家得到充分训练并以非冗余方式实现专业化。然而，由于缺乏既定指标，评估这一点变得复杂；更重要的是，许多路由技术在较小规模下表现出相似性能，而这通常无法反映其在大规模下的真实行为。为应对这一挑战，我们提出了MoE路由测试平台，该平台通过使用真实数据，在小规模下为路由动态提供了更清晰的观察窗口。该测试平台将具有明显可区分领域的数据混合与一个基于这些领域规定理想路由的参考路由器相结合，从而为比较提供了明确定义的上限。这使得专家专业化的量化测量成为可能。为展示测试平台的价值，我们比较了多种MoE路由方法，并证明平衡范围是允许专家实现专业化同时保持高利用率的关键因素。我们确认这一观察结论可推广至规模大35倍的模型。

摘要 (Abstract)

Sparse Mixture-of-Experts (MoE) architectures are increasingly popular for frontier large language models (LLM) but they introduce training challenges due to routing complexity. Fully leveraging parameters of an MoE model requires all experts to be well-trained and to specialize in non-redundant ways. Assessing this, however, is complicated due to lack of established metrics and, importantly, many routing techniques exhibit similar performance at smaller sizes, which is often not reflective of their behavior at large scale. To address this challenge, we propose the MoE Routing Testbed, a setup that gives clearer visibility into routing dynamics at small scale while using realistic data. The testbed pairs a data mix with clearly distinguishable domains with a reference router that prescribes ideal routing based on these domains, providing a well-defined upper bound for comparison. This enables quantifiable measurement of expert specialization. To demonstrate the value of the testbed, we compare various MoE routing approaches and show that balancing scope is the crucial factor that allows specialization while maintaining high expert utilization. We confirm that this observation generalizes to models 35x larger.

关键词: Mixture-of-Experts, MoE, routing, expert specialization, large language models, testbed, scaling, utilization

280. ❌ Learning to Query History: Nonstationary Classification via Learned Retrieval

作者: Jimmy Gammell, Bishal Thapaliya, Yoon Jung, Riyasat Ohib, Bilel Fehri, Deepayan Chakrabarti 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07027v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究非平稳分类问题，提出通过检索历史标记样本来增强分类器对分布漂移的鲁棒性，并引入一种基于学习的离散检索机制。论文内容主要涉及时间序列预测、检索机制和分类器训练，但未涉及大模型、深度学习技术原理或AI在科学领域的应用。所有关键词均与大模型、深度学习技术或AI科学应用相关，而该论文专注于传统机器学习分类问题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种通过检索历史标记样本来增强分类器对非平稳数据分布漂移鲁棒性的方法，实验表明该方法在合成基准和亚马逊评论数据集上比标准分类器表现更好。

摘要翻译

非平稳性在实际分类场景中普遍存在，这导致已部署的模型即使在训练时能良好地泛化至预留测试集，其实际表现仍可能不佳。为解决这一问题，我们将非平稳分类重新定义为时间序列预测任务：不再仅基于当前输入进行预测，而是让分类器以一系列超出训练截止时间的历史标注样本为条件。为适应大规模序列处理，我们引入一种可学习的离散检索机制，该机制通过输入相关的查询来采样相关的历史样本，并利用基于分数的梯度估计器与分类器进行端到端联合训练。这使得全部历史数据语料在训练和部署期间能够持续存储于任意文件系统中。在合成基准测试和亚马逊评论'23（电子产品类目）数据集上的实验表明，相较于标准分类器，该方法对分布漂移具有更强的鲁棒性，且随着历史数据序列长度的增加，显存（VRAM）消耗的变化符合预期规律。

摘要 (Abstract)

Nonstationarity is ubiquitous in practical classification settings, leading deployed models to perform poorly even when they generalize well to holdout sets available at training time. We address this by reframing nonstationary classification as time series prediction: rather than predicting from the current input alone, we condition the classifier on a sequence of historical labeled examples that extends beyond the training cutoff. To scale to large sequences, we introduce a learned discrete retrieval mechanism that samples relevant historical examples via input-dependent queries, trained end-to-end with the classifier using a score-based gradient estimator. This enables the full corpus of historical data to remain on an arbitrary filesystem during training and deployment. Experiments on synthetic benchmarks and Amazon Reviews ‘23 (electronics category) show improved robustness to distribution shift compared to standard classifiers, with VRAM scaling predictably as the length of the historical data sequence increases.

关键词: nonstationary classification, time series prediction, learned retrieval, historical examples, distribution shift, robustness, score-based gradient estimator, Amazon Reviews

281. ❌ Physics-Informed Functional Link Constrained Framework with Domain Mapping for Solving Bending Analysis of an Exponentially Loaded Perforated Beam

作者: Iswari Sahu, Ramanath Garai, S. Chakraverty 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07025v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是工程力学中穿孔梁的弯曲分析问题，采用DFL-TFC（域映射物理信息功能链接理论）方法求解微分方程。论文内容完全聚焦于计算力学、数值方法和工程应用，没有涉及任何大语言模型、深度学习、AI技术或相关关键词。所有评分关键词都是关于大模型技术、AI方法及其应用的，而该论文是纯粹的工程力学数值计算研究，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于域映射物理信息功能链接理论（DFL-TFC）的新方法，用于分析指数载荷下锥形穿孔梁的弯曲行为，结果表明该方法比PINN方法具有更快的收敛速度、更低的计算成本和更高的求解精度。

摘要翻译

本文提出了一种新颖且全面的方法，用于分析指数载荷下锥形开孔梁的弯曲行为。控制微分方程包含了填充率（$α$）、孔洞行数（$N$）、锥度参数（$φ$ 和 $ψ$）以及指数载荷参数（$γ$）等重要因素，从而能够真实且灵活地描述开孔梁的构型。本研究的主要目标是评估域映射物理信息功能连接理论功能链接（DFL-TFC）方法在分析指数载荷下方形孔洞开孔梁弯曲响应方面的性能。为便于比较，同时开发了相应的基于物理信息神经网络（PINN）的求解公式。结果清楚地表明，与PINN方法相比，所提出的DFL-TFC框架能提供更优的结果，包括更快的收敛速度、更低的计算成本和更高的求解精度。这些发现凸显了DFL-TFC方法在求解微分方程控制的复杂工程问题方面的有效性和潜力。在该框架内，隐藏层被一个功能扩展块所取代，该模块通过正交多项式基函数丰富输入表示，并将微分方程（DE）的定义域映射到相应正交多项式的定义域。通过功能连接理论（TFC）利用边界条件构建的约束表达式（CE），确保了约束条件被精确满足。在CE中，自由函数由功能链接神经网络（FLNN）表示，该网络通过学习来求解由此产生的无约束优化问题。所得结果进一步通过伽辽金法和PINN解进行了验证。

摘要 (Abstract)

This article presents a novel and comprehensive approach for analyzing bending behavior of the tapered perforated beam under an exponential load. The governing differential equation includes important factors like filling ratio ($α$), number of rows of holes ($N$), tapering parameters ($φ$ and $ψ$), and exponential loading parameter ($γ$), providing a realistic and flexible representation of perforated beam configuration. Main goal of this work is to see how well the Domain mapped physics-informed Functional link Theory of Functional Connection (DFL-TFC) method analyses bending response of perforated beam with square holes under exponential loading. For comparison purposes, a corresponding PINN-based formulation is developed. Outcomes clearly show that the proposed DFL-TFC framework gives better results, including faster convergence, reduced computational cost, and improved solution accuracy when compared to the PINN approach. These findings highlight effectiveness and potential of DFL-TFC method for solving complex engineering problems governed by differential equations. Within this framework, hidden layer is replaced by a functional expansion block that enriches input representation via orthogonal polynomial basis functions, and the domain of DE mapped to corresponding domain of orthogonal polynomials. A Constrained Expression (CE), constructed through the Theory of Functional Connections (TFC) using boundary conditions, ensures that constraints are exactly satisfied. In CE, free function is represented using a Functional Link Neural Network (FLNN), which learns to solve resulting unconstrained optimization problem. The obtained results are further validated through the Galerkin and PINN solutions.

关键词: perforated beam, exponential loading, DFL-TFC, physics-informed, functional link neural network, bending analysis, differential equations, numerical solution

282. ❌ Predictive Representations for Skill Transfer in Reinforcement Learning

作者: Ruben Vereecken, Luke Dickens, Alessandra Russo 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07016v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于强化学习中的技能迁移和状态抽象，提出了一种基于结果预测的状态表示（OPSR）和技能框架。虽然涉及AI和机器学习，但所有关键词均与大模型、深度学习技术原理或特定AI应用领域（如生物信息学）相关，而本文研究的是强化学习中的基础算法问题，未涉及大模型、深度学习架构、训练方法、推理优化、对齐技术、代理系统或特定科学领域应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对强化学习中知识迁移的挑战，提出了一种基于结果预测的状态表示（OPSR）和技能框架，通过状态抽象实现任务间技能重用，显著加速了新任务的学习。

摘要翻译

强化学习规模化应用的一个核心挑战在于泛化已习得的行为。若智能体无法传承已获取的知识，则注定要从零开始学习每项任务。本文通过状态抽象提出一种新的迁移学习形式化框架。基于对环境任务无关的紧凑观测（结果），我们引入了结果预测状态表征（Outcome-Predictive State Representations, OPSRs）——一种以智能体为中心、任务无关的抽象表示，由对结果的预测构成。我们从形式化分析和实证研究两方面证明，该表征具备实现最优但有限迁移的潜力，进而通过引入基于OPSR的技能（即基于选项的抽象动作）克服了这一权衡限制。这些技能因状态抽象而能在不同任务间复用。在一系列实证研究中，我们通过演示数据学习基于OPSR的技能，并证明它们能在完全陌生、未经预处理的任��中显著加速学习进程。我们相信，本工作提出的框架是推动强化学习迁移研究的重要进展，尤其为通过结合状态抽象与动作抽象实现知识迁移开辟了前景。

摘要 (Abstract)

A key challenge in scaling up Reinforcement Learning is generalizing learned behaviour. Without the ability to carry forward acquired knowledge an agent is doomed to learn each task from scratch. In this paper we develop a new formalism for transfer by virtue of state abstraction. Based on task-independent, compact observations (outcomes) of the environment, we introduce Outcome-Predictive State Representations (OPSRs), agent-centered and task-independent abstractions that are made up of predictions of outcomes. We show formally and empirically that they have the potential for optimal but limited transfer, then overcome this trade-off by introducing OPSR-based skills, i.e. abstract actions (based on options) that can be reused between tasks as a result of state abstraction. In a series of empirical studies, we learn OPSR-based skills from demonstrations and show how they speed up learning considerably in entirely new and unseen tasks without any pre-processing. We believe that the framework introduced in this work is a promising step towards transfer in RL in general, and towards transfer through combining state and action abstraction specifically.

关键词: Reinforcement Learning, Skill Transfer, State Abstraction, Outcome-Predictive State Representations, OPSR, Options, Demonstration Learning, Knowledge Generalization

283. ❌ QNAS: A Neural Architecture Search Framework for Accurate and Efficient Quantum Neural Networks

作者: Kooshan Maleki, Alberto Marchisio, Muhammad Shafique 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07013v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于量子神经网络（QNN）的架构搜索，属于量子计算与AI的交叉领域。所有关键词均针对大语言模型（LLM）及其相关技术（如训练、推理、对齐、代理等），而本文研究的是量子神经网络，与LLM技术栈无直接关联。唯一略有相关的是“AI for Science”，因为量子计算可视为科学计算的一个前沿方向，但论文未明确涉及生物信息学或化学信息学，故给5分（有一定关联）。其他关键词均不涉及，故为0分。

!!! tip deepseek-chat TL;DR

论文提出QNAS框架，通过神经架构搜索自动设计兼顾精度、效率和硬件部署的量子神经网络，在多个基准测试中实现了高精度且资源高效的电路设计。

摘要翻译

设计兼具高精度与可在含噪声中等规模量子（NISQ）硬件上部署的量子神经网络（QNNs）是一项挑战。人工设计的拟设（ansatze）必须在表达能力、可训练性与资源消耗之间取得平衡，而有限的量子比特数往往迫使采用电路切割技术。现有的量子架构搜索方法主要优化精度，仅启发性地控制量子资源开销，且大多忽略了电路切割带来的指数级开销。我们提出QNAS，一种专为混合量子经典神经网络（HQNNs）设计的神经架构搜索框架，它统一了硬件感知评估、多目标优化与切割开销感知。QNAS训练一个共享参数的超级电路（SuperCircuit），并利用NSGA-II算法联合优化三个目标：（i）验证误差，（ii）衡量实际评估时间的运行时成本代理指标，以及（iii）在目标量子比特预算下估算的子电路数量。QNAS通过少量训练轮次评估候选HQNNs，并发现清晰的帕累托前沿，揭示了精度、效率与切割开销之间的权衡关系。在MNIST、Fashion-MNIST和Iris基准测试中，我们观察到嵌入类型和CNOT模式选择对精度和效率均有显著影响：在图像数据集上，角度-y嵌入（angle-y embedding）与稀疏纠缠模式优于其他配置；而在表格数据（Iris）上，振幅嵌入（amplitude embedding）表现卓越。在MNIST上，最佳架构以紧凑的8量子比特、2层电路实现了97.16%的测试精度；在更具挑战性的Fashion-MNIST上，以5量子比特、2层电路实现了87.38%的精度；在Iris上，以4量子比特、2层电路达到了100%的验证精度。QNAS在搜索过程中自动揭示这些设计洞见，引导实践者找到能够在当前硬件上平衡精度、资源效率与实际可部署性的架构。

摘要 (Abstract)

Designing quantum neural networks (QNNs) that are both accurate and deployable on NISQ hardware is challenging. Handcrafted ansatze must balance expressivity, trainability, and resource use, while limited qubits often necessitate circuit cutting. Existing quantum architecture search methods primarily optimize accuracy while only heuristically controlling quantum and mostly ignore the exponential overhead of circuit cutting. We introduce QNAS, a neural architecture search framework that unifies hardware aware evaluation, multi objective optimization, and cutting overhead awareness for hybrid quantum classical neural networks (HQNNs). QNAS trains a shared parameter SuperCircuit and uses NSGA-II to optimize three objectives jointly: (i) validation error, (ii) a runtime cost proxy measuring wall clock evaluation time, and (iii) the estimated number of subcircuits under a target qubit budget. QNAS evaluates candidate HQNNs under a few epochs of training and discovers clear Pareto fronts that reveal tradeoffs between accuracy, efficiency, and cutting overhead. Across MNIST, Fashion-MNIST, and Iris benchmarks, we observe that embedding type and CNOT mode selection significantly impact both accuracy and efficiency, with angle-y embedding and sparse entangling patterns outperforming other configurations on image datasets, and amplitude embedding excelling on tabular data (Iris). On MNIST, the best architecture achieves 97.16% test accuracy with a compact 8 qubit, 2 layer circuit; on the more challenging Fashion-MNIST, 87.38% with a 5 qubit, 2 layer circuit; and on Iris, 100% validation accuracy with a 4 qubit, 2 layer circuit. QNAS surfaces these design insights automatically during search, guiding practitioners toward architectures that balance accuracy, resource efficiency, and practical deployability on current hardware.

关键词: Quantum Neural Networks, Neural Architecture Search, Hardware Aware, Multi-objective Optimization, Circuit Cutting, NISQ Hardware, Hybrid Quantum Classical Neural Networks, Pareto Front

284. ❌ ELC: Evidential Lifelong Classifier for Uncertainty Aware Radar Pulse Classification

作者: Mohamed Rabie, Chinthana Panagamuwa, Konstantinos G. Kyriakopoulos 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06958v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于雷达脉冲分类的特定应用领域，使用深度神经网络、不确定性量化和终身学习技术，但未涉及任何大语言模型（LLMs）、深度学习技术原理创新或大模型在不同领域的应用。所有关键词均与大模型、深度学习技术原理或AI for Science相关，而本文研究的是传统的深度神经网络在雷达信号处理中的应用，与评分关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种证据终身分类器（ELC），通过整合不确定性量化和终身学习来解决雷达脉冲分类中高效学习新脉冲和表达预测置信度的挑战，在低信噪比条件下相比贝叶斯方法显著提高了召回率。

摘要翻译

可靠的雷达脉冲分类对于电磁战中的态势感知与决策支持至关重要。深度神经网络在雷达脉冲与射频辐射源识别方面已展现出强大性能；然而，其自身难以高效学习新脉冲，且缺乏表达预测置信度的机制。本文通过将不确定性量化与终身学习相结合，以应对这两项挑战。所提出的方法是一种证据终身分类器，其利用证据理论对认知不确定性进行建模。该分类器与基于香农熵量化不确定性的贝叶斯终身分类器进行了对比评估。两者均集成了“学习-剪枝-共享”机制以实现对新脉冲的持续学习，并采用基于不确定性的选择性预测来拒绝不可靠的预测。两种方法在2个合成雷达数据集和3个射频指纹数据集上进行了评估。在合成雷达脉冲数据集上，基于证据不确定性的选择性预测在-20 dB信噪比条件下将召回率最高提升了46%，突显了其在低信噪比条件下相较于贝叶斯终身分类器能更有效地识别不可靠预测。这些结果表明，证据不确定性在置信度与正确性之间建立了强关联，通过使分类器能够表达“未知”状态，提升了证据终身分类器的可信度。

摘要 (Abstract)

Reliable radar pulse classification is essential in Electromagnetic Warfare for situational awareness and decision support. Deep Neural Networks have shown strong performance in radar pulse and RF emitter recognition; however, on their own they struggle to efficiently learn new pulses and lack mechanisms for expressing predictive confidence. This paper integrates Uncertainty Quantification with Lifelong Learning to address both challenges. The proposed approach is an Evidential Lifelong Classifier (ELC), which models epistemic uncertainty using evidence theory. ELC is evaluated against a Bayesian Lifelong Classifier (BLC), which quantifies uncertainty through Shannon entropy. Both integrate Learn-Prune-Share to enable continual learning of new pulses and uncertainty-based selective prediction to reject unreliable predictions. ELC and BLC are evaluated on 2 synthetic radar and 3 RF fingerprinting datasets. Selective prediction based on evidential uncertainty improves recall by up to 46% at -20 dB SNR on synthetic radar pulse datasets, highlighting its effectiveness at identifying unreliable predictions in low-SNR conditions compared to BLC. These findings demonstrate that evidential uncertainty offers a strong correlation between confidence and correctness, improving the trustworthiness of ELC by allowing it to express ignorance.

关键词: Radar Pulse Classification, Uncertainty Quantification, Lifelong Learning, Evidential Classifier, Deep Neural Networks, Selective Prediction, RF Fingerprinting, Electromagnetic Warfare

285. ❌ NestPipe: Large-Scale Recommendation Training on 1,500+ Accelerators via Nested Pipelining

作者: Zhida Jiang, Zhaolong Xing, Huichao Chai, Tianxing Sun, Qiang Peng, Baopeng Yuan, Jiaxing Wang, Hua Du, Zhixin Wu, Xuemiao Li, Yikui Cao, Xinyu Liu, Yongxiang Feng, Zhen Chen, Ke Zhang 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06956v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《NestPipe: Large-Scale Recommendation Training on 1,500+ Accelerators via Nested Pipelining》专注于大规模推荐系统的分布式训练优化，特别是针对嵌入查找和通信延迟的瓶颈问题。虽然论文涉及深度学习和大规模模型训练，但其核心内容（推荐系统、嵌入训练、分布式训练框架、流水线优化）与提供的所有关键词（均围绕大语言模型、对齐、推理、代理、科学AI等特定技术）无直接关联。所有关键词评分为0，因为论文未讨论任何大语言模型、MoE、对齐、推理、代理、量化等主题，也未涉及科学领域的AI应用。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为NestPipe的大规模去中心化嵌入训练框架，通过嵌套流水线技术解决了推荐模型在千级加速器集群上训练时的数据移动瓶颈问题，实验表明在1536个工作者上实现了最高3.06倍的加速和94.07%的扩展效率。

摘要翻译

现代推荐模型的参数量已增长至万亿级别。随着集群规模扩展至千级节点，分布式训练的瓶颈已从计算与内存转移至数据移动，特别是与嵌入层相关的查找和通信延迟。现有解决方案或仅优化单一瓶颈，或通过牺牲训练一致性来提升吞吐量。本文提出NestPipe，一种大规模去中心化嵌入训练框架，在保持同步训练语义的同时解决上述双重瓶颈。NestPipe通过嵌套流水线技术挖掘两种层次化的稀疏并行机会：在批次间层面，双缓冲流水线（Dual-Buffer Pipelining, DBP）通过双缓冲同步构建无陈旧值的五级流水线，在避免嵌入状态陈旧的前提下缓解查找瓶颈；在批次内层面，我们发现了嵌入冻结现象，据此提出冻结窗口流水线（Frozen-Window Pipelining, FWP），通过协调流调度与以键为中心的样本聚类，将All2All通信与稠密计算重叠执行。在1,536个工作节点的生产级GPU与NPU集群上的实验表明，NestPipe最高可实现3.06倍的加速比与94.07%的扩展效率。

摘要 (Abstract)

Modern recommendation models have increased to trillions of parameters. As cluster scales expand to O(1k), distributed training bottlenecks shift from computation and memory to data movement, especially lookup and communication latency associated with embeddings. Existing solutions either optimize only one bottleneck or improve throughput by sacrificing training consistency. This paper presents NestPipe, a large-scale decentralized embedding training framework that tackles both bottlenecks while preserving synchronous training semantics. NestPipe exploits two hierarchical sparse parallelism opportunities through nested pipelining. At the inter-batch level, Dual-Buffer Pipelining (DBP) constructs a staleness-free five-stage pipeline through dual-buffer synchronization, mitigating lookup bottlenecks without embedding staleness. At the intra-batch level, we identify the embedding freezing phenomenon, which inspires Frozen-Window Pipelining (FWP) to overlap All2All communication with dense computation via coordinated stream scheduling and key-centric sample clustering. Experiments on production GPU and NPU clusters with 1,536 workers demonstrate that NestPipe achieves up to 3.06x speedup and 94.07% scaling efficiency.

关键词: large-scale recommendation training, distributed training, embedding training, nested pipelining, data movement bottleneck, synchronous training semantics, scaling efficiency, GPU/NPU clusters

286. ❌ Continuous-Time Dynamics of the Difference-of-Convex Algorithm

作者: Yi-Shuai Niu 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06926v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是数学优化算法（Difference-of-Convex Algorithm, DCA）的连续时间动力学，属于凸优化和数值分析领域。摘要和标题中完全没有提及大模型、深度学习、AI应用或任何相关技术。所有关键词都专注于大模型技术、训练方法、推理优化、对齐、应用等，与这篇纯数学优化论文完全无关。

!!! tip deepseek-chat TL;DR

该论文研究了用于光滑DC分解的差凸算法（DCA）的连续时间动力学结构，揭示了其与非线性自治系统、Bregman几何的联系，并分析了阻尼DCA方案和极限流的收敛性质。

摘要翻译

本文研究针对具有强凸分量的光滑DC分解的凸差算法（DCA）的连续时间结构。在对偶坐标下，经典DCA完全等价于一个非线性自治系统的全步显式欧拉离散化。这一视角启发了一种阻尼DCA方案，它同时也是Bregman正则化的DCA变体，其步长趋于零的极限产生了一个由分解中凸部分生成的Hessian-Riemannian梯度流。对于阻尼方案，我们证明了单调下降性、渐近临界性、有界性下的Kurdyka-Lojasiewicz收敛性，以及在度量DC-PL不等式下的全局线性收敛速率。对于极限流，我们建立了精确的能量恒等式、有界轨迹的渐近临界性、度量相对误差界下的显式全局收敛速率、Kurdyka-Lojasiewicz假设下的有限长度与单点收敛性，以及在非退化局部极小值附近的局部指数收敛性。分析还揭示了一种全局-局部权衡：半松弛方案在我们的框架中提供了最佳的可证明全局保证，而全步方案在非退化极小值附近具有最快的局部收敛速度。最后，我们证明了同一目标函数的不同DC分解会通过凸分量生成的度量诱导出不同的连续动力学，这为分解质量提供了一个几何判据，并将DCA与Bregman几何联系起来。

摘要 (Abstract)

We study the continuous-time structure of the difference-of-convex algorithm (DCA) for smooth DC decompositions with a strongly convex component. In dual coordinates, classical DCA is exactly the full-step explicit Euler discretization of a nonlinear autonomous system. This viewpoint motivates a damped DCA scheme, which is also a Bregman-regularized DCA variant, and whose vanishing-step limit yields a Hessian-Riemannian gradient flow generated by the convex part of the decomposition. For the damped scheme we prove monotone descent, asymptotic criticality, Kurdyka-Lojasiewicz convergence under boundedness, and a global linear rate under a metric DC-PL inequality. For the limiting flow we establish an exact energy identity, asymptotic criticality of bounded trajectories, explicit global rates under metric relative error bounds, finite-length and single-point convergence under a Kurdyka-Lojasiewicz hypothesis, and local exponential convergence near nondegenerate local minima. The analysis also reveals a global-local tradeoff: the half-relaxed scheme gives the best provable global guarantee in our framework, while the full-step scheme is locally fastest near a nondegenerate minimum. Finally, we show that different DC decompositions of the same objective induce different continuous dynamics through the metric generated by the convex component, providing a geometric criterion for decomposition quality and linking DCA with Bregman geometry.

关键词: Difference-of-Convex Algorithm, DCA, continuous-time dynamics, Bregman geometry, convex optimization, gradient flow, convergence analysis, DC decomposition

287. ❌ Evaluating PQC KEMs, Combiners, and Cascade Encryption via Adaptive IND-CPA Testing Using Deep Learning

作者: Simon Calderon, Niklas Johansson, Onur Günlü 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06942v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究使用深度神经网络（DNN）作为自适应、实用的经验估计器，在更一般的IND-CPA设置中评估密码学算法的不可区分性，包括后量子密码学（PQC）KEM、组合器和级联加密。论文的核心是深度学习在密码学安全验证中的应用，属于AI在科学领域的应用（具体是密码学），因此与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（评分5分）。然而，论文未涉及大语言模型（LLMs）、模型架构（如MoE、SLMs）、训练技术（如预训练、微调、对齐）、推理优化、智能体、模型压缩等关键词，这些关键词均与大模型和深度学习技术原理的创新直接相关，而本文专注于传统的深度神经网络在特定密码学任务中的应用，与这些关键词完全无关（评分0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种使用深度神经网络作为自适应经验估计器的方法，通过建模IND-CPA游戏为二分类任务，来评估后量子密码学KEM、组合器和级联加密的密文不可区分性，实验结果表明在所测试的算法和组合中未发现显著优势，验证了深度学习在密码学安全实证分析中的潜力。

摘要翻译

确保密文不可区分性是密码学安全的基础，但在实际实现和混合场景中通过经验方法验证这一属性存在现实挑战。向后量子密码学（PQC）的过渡催生了结合经典密码原语与抗量子原语的混合构造，这使得经验验证方法的价值日益凸显。通过将IND-CPA游戏建模为二分类任务，并在带标签的密文数据上使用二元交叉熵损失进行训练，我们研究了用于密文不可区分性判定的深度神经网络（DNN）区分器。我们将此方法应用于PQC密钥封装机制（KEM），具体测试了用于构建ML-KEM、BIKE和HQC等实例的公钥加密（PKE）方案。此外，本文提出了一种新颖的DNN建模扩展方法，用于对混合KEM进行经验性可区分测试，并在PQC KEM与纯RSA、RSA-OAEP以及明文的不同组合上进行了实现与测试。最后，通过将DNN IND-CPA分类框架应用于级联对称加密（测试了AES-CTR、AES-CBC、AES-ECB、ChaCha20和DES-ECB的组合），展示了该方法论的通用性。在对PQC算法、KEM组合器及级联加密的实验中，未发现任何算法或算法组合具有显著优势（双侧二项检验，显著性水平$α=0.01$），这与理论保证一致——即包含至少一个IND-CPA安全组件的混合方案能保持不可区分性，且在所考虑的DNN敌手模型下未发现可利用的模式。这些结果表明，在更一般的IND-CPA场景中，利用深度学习作为一种自适应的、实用的、多功能的经验估计器来评估不可区分性具有潜力，能够为实现方案和组合提供数据驱动的验证，从而补充理论安全性分析。

摘要 (Abstract)

Ensuring ciphertext indistinguishability is fundamental to cryptographic security, but empirically validating this property in real implementations and hybrid settings presents practical challenges. The transition to post-quantum cryptography (PQC), with its hybrid constructions combining classical and quantum-resistant primitives, makes empirical validation approaches increasingly valuable. By modeling IND-CPA games as binary classification tasks and training on labeled ciphertext data with BCE loss, we study deep neural network (DNN) distinguishers for ciphertext indistinguishability. We apply this methodology to PQC KEMs. We specifically test the public-key encryption (PKE) schemes used to construct examples such as ML-KEM, BIKE, and HQC. Moreover, a novel extension of this DNN modeling for empirical distinguishability testing of hybrid KEMs is presented. We implement and test this on combinations of PQC KEMs with plain RSA, RSA-OAEP, and plaintext. Finally, methodological generality is illustrated by applying the DNN IND-CPA classification framework to cascade symmetric encryption, where we test combinations of AES-CTR, AES-CBC, AES-ECB, ChaCha20, and DES-ECB. In our experiments on PQC algorithms, KEM combiners, and cascade encryption, no algorithm or combination of algorithms demonstrates a significant advantage (two-sided binomial test, significance level $α= 0.01$), consistent with theoretical guarantees that hybrids including at least one IND-CPA-secure component preserve indistinguishability, and with the absence of exploitable patterns under the considered DNN adversary model. These illustrate the potential of using deep learning as an adaptive, practical, and versatile empirical estimator for indistinguishability in more general IND-CPA settings, allowing data-driven validation of implementations and compositions and complementing the analytical security analysis.

关键词: Post-Quantum Cryptography, IND-CPA Testing, Deep Neural Networks, Ciphertext Indistinguishability, KEM Combiners, Cascade Encryption, Empirical Validation, Binary Classification

288. ❌ Equivariant Multi-agent Reinforcement Learning for Multimodal Vehicle-to-Infrastructure Systems

作者: Charbel Bou Chaaya, Mehdi Bennis 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06914v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究车辆到基础设施（V2I）系统中的多智能体强化学习（MARL）问题，提出了一种利用旋转对称性的等变策略网络和自监督多模态感知方法。论文核心是MARL在通信网络优化中的应用，与绝大多数关键词（涉及大模型技术、训练方法、推理优化等）完全无关。唯一相关的关键词是’Multi-agent Systems OR Agent Coordination’，因为论文明确研究多智能体系统的协调问题，这是论文的核心内容，因此给予10分。其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

该论文研究了车辆到基础设施系统中基于多模态感知的分布式多智能体强化学习问题，提出了一种等变策略网络和自监督感知框架，在仿真实验中相比基线方法实现了超过50%的性能提升和两倍以上的精度增益。

摘要翻译

本文研究一种车路协同系统，其中分布式基站作为路侧单元从移动车辆收集多模态无线与视觉数据。我们考虑一个去中心化的速率最大化问题：每个路侧单元依赖其本地观测优化自身资源，同时所有路侧单元必须协作以保证良好的网络性能。通过引入车辆位置相关的旋转对称性，我们将该问题重构为分布式多智能体强化学习问题。为利用这些对称性，我们提出一种新颖的自监督学习框架，其中每个基站智能体通过对齐其多模态观测的隐特征来提取本地区域内车辆的位置。基于各路侧单元获取的感知数据，我们采用具有消息传递层的图神经网络训练一个等变策略网络，使得每个智能体可本地计算其策略，同时所有智能体通过一种信令方案协调彼此策略，该方案克服了部分可观测性并保证了全局策略的等变性。我们在仿真环境中进行了数值实验，其中利用射线追踪与计算机图形技术采集无线和视觉数据。结果表明：我们的自监督多模态感知方法具有良好泛化能力，相比基线方法实现两倍以上的精度提升；所提出的等变多智能体强化学习训练方法高效，相比标准方法获得超过50%的性能增益。

摘要 (Abstract)

In this paper, we study a vehicle-to-infrastructure (V2I) system where distributed base stations (BSs) acting as road-side units (RSUs) collect multimodal (wireless and visual) data from moving vehicles. We consider a decentralized rate maximization problem, where each RSU relies on its local observations to optimize its resources, while all RSUs must collaborate to guarantee favorable network performance. We recast this problem as a distributed multi-agent reinforcement learning (MARL) problem, by incorporating rotation symmetries in terms of vehicles’ locations. To exploit these symmetries, we propose a novel self-supervised learning framework where each BS agent aligns the latent features of its multimodal observation to extract the positions of the vehicles in its local region. Equipped with this sensing data at each RSU, we train an equivariant policy network using a graph neural network (GNN) with message passing layers, such that each agent computes its policy locally, while all agents coordinate their policies via a signaling scheme that overcomes partial observability and guarantees the equivariance of the global policy. We present numerical results carried out in a simulation environment, where ray-tracing and computer graphics are used to collect wireless and visual data. Results show the generalizability of our self-supervised and multimodal sensing approach, achieving more than two-fold accuracy gains over baselines, and the efficiency of our equivariant MARL training, attaining more than 50% performance gains over standard approaches.

关键词: Multi-agent Reinforcement Learning, Vehicle-to-Infrastructure, Multimodal Sensing, Equivariant Policy, Graph Neural Network, Self-supervised Learning, Decentralized Optimization, Wireless Communication

289. ❌ Data Leakage in Automotive Perception: Practitioners’ Insights

作者: Md Abu Ahammed Babu, Sushant Kumar Pandey, Darko Durisic, Andras Balint, Miroslaw Staron 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06899v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究汽车感知系统中机器学习模型的数据泄露问题，通过访谈工业从业者探讨其认知、经验和缓解策略。论文聚焦于机器学习可靠性工程和数据实践，但未涉及大模型、深度学习技术原理或科学领域应用。所有关键词均与大模型技术、深度学习创新或AI科学应用相关，而本文讨论的是传统机器学习在特定工业领域（汽车感知）的数据管理问题，与评分关键词完全无关。

!!! tip deepseek-chat TL;DR

该研究通过访谈汽车感知系统工程师，揭示了数据泄露问题在工业实践中是一个跨角色的社会技术协调问题，需要共享定义、可追溯数据实践和持续跨角色沟通来制度化数据泄露意识。

摘要翻译

数据泄露是指训练数据集与评估数据集之间无意的信息传递，这对汽车感知等安全关键系统中机器学习（ML）模型的可靠性构成微妙而关键的风险。尽管数据泄露在研究中已被广泛认知，但业界从业者如何在实践中真正认识并管理它却鲜为人知。本研究通过对从事汽车感知功能开发的系统设计、开发和验证工程师进行十次半结构化访谈，调查了从业者关于数据泄露的知识、经验及缓解策略。通过反思性主题分析，我们发现关于数据泄露的知识普遍存在且沿角色边界碎片化：机器学习工程师将其概念化为数据划分或验证问题，而设计和验证角色则从代表性和场景覆盖度的角度进行理解。数据泄露的检测通常源于通用考量或观察到的性能异常，而非依赖特定工具。然而，预防数据泄露在实践中更为常见，这主要依赖于经验和知识共享。这些发现表明，泄露控制是一个分布于不同角色和工作流程中的社会技术协调问题。我们讨论了其对机器学习可靠性工程的影响，强调需要建立共享定义、可追溯的数据实践以及持续的跨角色沟通，从而在汽车机器学习开发中将数据泄露意识制度化。

摘要 (Abstract)

Data leakage is the inadvertent transfer of information between training and evaluation datasets that poses a subtle, yet critical, risk to the reliability of machine learning (ML) models in safety-critical systems such as automotive perception. While leakage is widely recognized in research, little is known about how industrial practitioners actually perceive and manage it in practice. This study investigates practitioners’ knowledge, experiences, and mitigation strategies around data leakage through ten semi-structured interviews with system design, development, and verification engineers working on automotive perception functions development. Using reflexive thematic analysis, we identify that knowledge of data leakage is widespread and fragmented along role boundaries: ML engineers conceptualize it as a data-splitting or validation issue, whereas design and verification roles interpret it in terms of representativeness and scenario coverage. Detection commonly arises through generic considerations and observed performance anomalies rather than implying specific tools. However, data leakage prevention is more commonly practiced, which depends mostly on experience and knowledge sharing. These findings suggest that leakage control is a socio-technical coordination problem distributed across roles and workflows. We discuss implications for ML reliability engineering, highlighting the need for shared definitions, traceable data practices, and continuous cross-role communication to institutionalize data leakage awareness within automotive ML development.

关键词: data leakage, automotive perception, machine learning reliability, industrial practitioners, socio-technical coordination, data practices, cross-role communication, safety-critical systems

290. ❌ VertAX: a differentiable vertex model for learning epithelial tissue mechanics

作者: Alessandro Pasqui, Jim Martin Catacora Ocana, Anshuman Sinha, Matthieu Perez, Fabrice Delbary, Giorgio Gosti, Mattia Miotto, Domenico Caudo, Maxence Ernoult, Hervé Turlier 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06896v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文VertAX专注于开发一个基于JAX的可微分顶点模型框架，用于模拟和优化上皮组织的力学行为，属于计算生物物理学和生物力学领域。虽然它使用了机器学习技术（如自动微分、GPU加速、端到端优化），但其核心是生物物理模拟和优化，而非大语言模型（LLM）或深度学习技术原理的创新。所有关键词（除最后一个外）均直接涉及大语言模型、深度学习技术、推理方法、对齐、优化等特定主题，与论文的生物物理模拟焦点无关。最后一个关键词“AI for Science OR Bioinformatics OR Cheminformatics”得5分，因为论文涉及AI在科学（具体是生物物理学）中的应用，但并非核心创新于大模型或深度学习技术，而是应用现有机器学习工具解决生物问题。

!!! tip deepseek-chat TL;DR

VertAX是一个基于JAX的可微分顶点模型框架，用于模拟上皮组织的力学行为，并通过自动微分和优化实现参数推断和逆设计，解决了传统顶点模型参数调优的挑战。

摘要翻译

上皮组织通过细胞间的局部力学相互作用动态重塑形态，这一过程可通过顶点模型有效刻画。然而，模型中大量可调参数使得推断与优化面临挑战，这促使我们需要能够灵活建模并学习组织力学的计算框架。本文提出VertAX——一个基于JAX的可微分框架，用于实现汇合上皮的顶点建模。VertAX提供自动微分、GPU加速以及端到端双层优化功能，支持正向模拟、参数推断和逆向力学设计。用户可使用纯Python定义任意能量函数与代价函数，从而实现与机器学习流程的无缝集成。我们在三个代表性任务中展示了VertAX的性能：（一）组织形态发生的正向建模，（二）力学参数推断，以及（三）组织尺度行为的逆向设计。我们对比了三种微分策略——自动微分、隐函数微分和平衡传播，结果表明后者仅需重复的正向无伴随模拟即可近似梯度计算，为将逆向生物物理问题拓展至不可微分模拟器提供了一条只需少量额外工程投入的简洁路径。

摘要 (Abstract)

Epithelial tissues dynamically reshape through local mechanical interactions among cells, a process well captured by vertex models. Yet their many tunable parameters make inference and optimization challenging, motivating computational frameworks that flexibly model and learn tissue mechanics. We introduce VertAX, a differentiable JAX-based framework for vertex-modeling of confluent epithelia. VertAX provides automatic differentiation, GPU acceleration, and end-to-end bilevel optimization for forward simulation, parameter inference, and inverse mechanical design. Users can define arbitrary energy and cost functions in pure Python, enabling seamless integration with machine-learning pipelines. We demonstrate VertAX on three representative tasks: (i) forward modeling of tissue morphogenesis, (ii) mechanical parameter inference, and (iii) inverse design of tissue-scale behaviors. We benchmark three differentiation strategies-automatic differentiation, implicit differentiation, and equilibrium propagation-showing that the latter can approximate gradients using repeated forward, adjoint-free simulations alone, offering a simple route for extending inverse biophysical problems to non-differentiable simulators with limited additional engineering effort.

关键词: differentiable vertex model, epithelial tissue mechanics, JAX framework, bilevel optimization, parameter inference, inverse mechanical design, GPU acceleration, automatic differentiation

291. ❌ MENO: MeanFlow-Enhanced Neural Operators for Dynamical Systems

作者: Tianyue Yang, Xiao Xue 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06881v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究的是神经算子（Neural Operators）在动力系统中的应用，属于科学机器学习（Scientific Machine Learning）领域。所有关键词均与大语言模型（LLMs）或深度学习技术原理直接相关，而本文专注于神经算子这一特定架构，不涉及LLMs、MoE、SLMs、对齐、推理、代理、压缩等主题。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在科学（特别是计算物理/流体动力学）中的应用，但并非核心匹配，故给5分。

!!! tip deepseek-chat TL;DR

论文提出了一种名为MENO的均值流增强神经算子框架，解决了现有神经算子在预测高分辨率动力系统时因高频分量截断而精度下降的问题，在保持计算效率的同时显著提升了预测准确性。

摘要翻译

神经算子因其网格无关特性与计算效率，已成为动力系统的强大代理模型。然而，基于傅里叶的神经算子框架本质上会截断谱空间中的高频分量，导致在低分辨率数据上训练时，小尺度结构丢失，且在高分辨率下的预测质量下降。尽管基于扩散的增强方法能够恢复多尺度特征，但它们引入了显著的推理开销，削弱了神经算子的效率优势。本文提出MeanFlow-Enhanced Neural Operators（MENO），这是一种新颖的框架，能够以最小的推理成本实现精确的全尺度预测。通过利用改进的MeanFlow方法，MENO以卓越的物理保真度和统计精度，同时恢复了小尺度细节和大尺度动力学特征。我们在三个具有挑战性的动力系统上评估MENO，包括相场动力学、二维Kolmogorov流和活性物质动力学，分辨率高达256×256。在所有基准测试中，与基线神经算子相比，MENO将功率谱密度精度提高了最多2倍，同时比最先进的扩散去噪隐式模型（Diffusion Denoising Implicit Model, DDIM）增强方法实现了12倍的推理加速，有效弥合了精度与效率之间的差距。MENO的灵活性和高效性使其成为科学机器学习应用中高效的代理模型，尤其适用于统计完整性与计算效率均至关重要的场景。

摘要 (Abstract)

Neural operators have emerged as powerful surrogates for dynamical systems due to their grid-invariant properties and computational efficiency. However, the Fourier-based neural operator framework inherently truncates high-frequency components in spectral space, resulting in the loss of small-scale structures and degraded prediction quality at high resolutions when trained on low-resolution data. While diffusion-based enhancement methods can recover multi-scale features, they introduce substantial inference overhead that undermines the efficiency advantage of neural operators. In this work, we introduce \textbf{M}eanFlow-\textbf{E}nhanced \textbf{N}eural \textbf{O}perators (MENO), a novel framework that achieves accurate all-scale predictions with minimal inference cost. By leveraging the improved MeanFlow method, MENO restores both small-scale details and large-scale dynamics with superior physical fidelity and statistical accuracy. We evaluate MENO on three challenging dynamical systems, including phase-field dynamics, 2D Kolmogorov flow, and active matter dynamics, at resolutions up to 256$\times$256. Across all benchmarks, MENO improves the power spectrum density accuracy by up to a factor of 2 compared to baseline neural operators while achieving 12$\times$ faster inference than the state-of-the-art Diffusion Denoising Implicit Model (DDIM)-enhanced counterparts, effectively bridging the gap between accuracy and efficiency. The flexibility and efficiency of MENO position it as an efficient surrogate model for scientific machine learning applications where both statistical integrity and computational efficiency are paramount.

关键词: Neural Operators, Dynamical Systems, MeanFlow Enhancement, Scientific Machine Learning, High-resolution Prediction, Computational Efficiency, Phase-field Dynamics, Kolmogorov Flow

292. ❌ A Data-Informed Variational Clustering Framework for Noisy High-Dimensional Data

作者: Wan Ping Chen 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06864v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《A Data-Informed Variational Clustering Framework for Noisy High-Dimensional Data》专注于高维噪声数据的聚类问题，提出了一种名为DIVI的变分聚类框架。该研究属于传统机器学习中的聚类算法领域，涉及特征选择、模型结构自适应和变分推断等技术。所有评分关键词均与大模型、深度学习技术原理或其在科学领域的应用直接相关，而本论文未涉及任何大模型、深度学习、AI for Science或相关技术（如微调、对齐、推理加速、智能体等）。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对高维噪声数据中聚类困难的问题，提出了一种结合全局特征门控和分裂式自适应结构增长的变分聚类框架DIVI，在保持计算可行性的同时提升了聚类性能并提供了可解释的特征门控行为。

摘要翻译

在高维场景下进行聚类分析且面临严重特征噪声时，该任务依然具有挑战性，尤其当仅有少量维度具有信息量且最终聚类数量未预先指定的情况下。在此类场景中，分区恢复、特征相关性学习与结构自适应紧密耦合，而标准的基于似然的方法可能变得不稳定或对噪声维度过度敏感。我们提出DIVI，一种数据驱动的变分聚类框架，该框架将全局特征门控与基于分裂的自适应结构生长相结合。DIVI利用信息性先验初始化来稳定优化过程，以可微分方式学习特征相关性，并仅在局部诊断表明欠拟合时才扩展模型复杂度。除聚类性能外，我们还检验了运行时的可扩展性与参数敏感性，以阐明该框架的计算特性与实际表现。实证研究表明，在严重特征噪声下，DIVI表现出竞争力，计算上保持可行，并产生可解释的特征门控行为，同时在挑战性场景中展现出保守的生长模式与可识别的失效机制。总体而言，DIVI应被视为一种针对高维噪声数据的实用变分聚类框架，而非一个完全贝叶斯的生成式解决方案。

摘要 (Abstract)

Clustering in high-dimensional settings with severe feature noise remains challenging, especially when only a small subset of dimensions is informative and the final number of clusters is not specified in advance. In such regimes, partition recovery, feature relevance learning, and structural adaptation are tightly coupled, and standard likelihood-based methods can become unstable or overly sensitive to noisy dimensions. We propose DIVI, a data-informed variational clustering framework that combines global feature gating with split-based adaptive structure growth. DIVI uses informative prior initialization to stabilize optimization, learns feature relevance in a differentiable manner, and expands model complexity only when local diagnostics indicate underfit. Beyond clustering performance, we also examine runtime scalability and parameter sensitivity in order to clarify the computational and practical behavior of the framework. Empirically, we find that DIVI performs competitively under severe feature noise, remains computationally feasible, and yields interpretable feature-gating behavior, while also exhibiting conservative growth and identifiable failure regimes in challenging settings. Overall, DIVI is best viewed as a practical variational clustering framework for noisy high-dimensional data rather than as a fully Bayesian generative solution.

关键词: variational clustering, high-dimensional data, feature noise, feature gating, adaptive structure growth, partition recovery, interpretable models, computational scalability

293. ❌ Contraction-Aligned Analysis of Soft Bellman Residual Minimization with Weighted Lp-Norm for Markov Decision Problem

作者: Hyukjun Yang, Han-Dong Lim, Donghwan Lee 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06837v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究马尔可夫决策过程（MDP）中的贝尔曼残差最小化问题，属于强化学习理论范畴，与所有关键词均无直接关联。关键词主要涉及大语言模型、深度学习技术、AI应用等，而本文专注于传统强化学习的数学分析和优化理论，未涉及任何大模型、深度学习或AI科学应用的内容。

!!! tip deepseek-chat TL;DR

该论文解决了马尔可夫决策过程中贝尔曼残差最小化与贝尔曼算子收缩几何不匹配的问题，通过引入加权Lp范数的软贝尔曼残差最小化方法，建立了优化目标与收缩几何的对齐关系，并推导了相应的性能误差界。

摘要翻译

在函数逼近条件下求解马尔可夫决策过程仍是一个基础性难题，即使在线性函数逼近设定下亦是如此。一个核心困难源于几何失配：虽然贝尔曼最优算子在L∞范数下具有压缩性，但常用的目标函数（如投影值迭代和贝尔曼残差最小化）均依赖于基于L2范数的表述。为实现基于梯度的优化，我们考虑贝尔曼残差最小化的柔性表述，并将其推广至广义加权Lp范数框架。我们证明随着p值增大，该表述能使优化目标与贝尔曼算子的压缩几何特性相匹配，并推导出相应的性能误差界。本分析建立了残差最小化与贝尔曼压缩之间的理论联系，在保持梯度优化兼容性的同时，实现了对误差传播的更优控制。

摘要 (Abstract)

The problem of solving Markov decision processes under function approximation remains a fundamental challenge, even under linear function approximation settings. A key difficulty arises from a geometric mismatch: while the Bellman optimality operator is contractive in the Linfty-norm, commonly used objectives such as projected value iteration and Bellman residual minimization rely on L2-based formulations. To enable gradient-based optimization, we consider a soft formulation of Bellman residual minimization and extend it to a generalized weighted Lp -norm. We show that this formulation aligns the optimization objective with the contraction geometry of the Bellman operator as p increases, and derive corresponding performance error bounds. Our analysis provides a principled connection between residual minimization and Bellman contraction, leading to improved control of error propagation while remaining compatible with gradient-based optimization.

关键词: Markov Decision Process, Bellman Residual Minimization, Weighted Lp-norm, Bellman Operator Contraction, Gradient-based Optimization, Performance Error Bounds, Linear Function Approximation

294. ❌ CBM-Dual: A 65-nm Fully Connected Chaotic Boltzmann Machine Processor for Dual Function Simulated Annealing and Reservoir Computing

作者: Kanta Yoshioka, Soshi Hirayae, Yuichiro Tanaka, Yuichi Katori, Takashi Morie, Hakaru Tamukoh 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06808v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是专用硬件处理器（CBM-Dual）用于实现混沌玻尔兹曼机，支持模拟退火和储层计算，属于边缘AI和专用计算架构领域。所有评分关键词均围绕大语言模型（LLM）及其相关技术（如训练、对齐、推理优化、应用等），而本文完全不涉及语言模型、深度学习或大模型技术，专注于硬件实现和特定计算模型（混沌动力学、玻尔兹曼机），因此与所有关键词完全无关。

!!! tip deepseek-chat TL;DR

本文提出了CBM-Dual，一种支持模拟退火和储层计算的双功能混沌玻尔兹曼机处理器，通过优化调度和乘法拆分方案显著降低了计算开销和芯片面积，实现了高效的边缘AI决策和自适应任务执行。

摘要翻译

本文提出了CBM-Dual，这是首个经过硅验证、同时支持模拟退火（SA）与储备池计算（RC）的数字混沌动力学处理器。该处理器采用目前最大规模的1024神经元全连接混沌玻尔兹曼机，旨在为自主边缘人工智能实现实时决策与轻量化自适应。为应对数字混沌动力学处理器高计算量与高面积成本的问题，我们提出：1）一种专用于混沌玻尔兹曼机的调度器，利用其固有的低神经元翻转率特性，将乘累加运算减少99%；2）一种高效的乘法分割方案，使面积减少59%。基于65纳米工艺制造的芯片（面积12mm²）CBM-Dual能够同时执行异构任务，并实现了领先的能效表现：在模拟退火和储备池计算领域分别取得了25至54倍和4.5倍的性能提升。

摘要 (Abstract)

This paper presents CBM-Dual, the first silicon-proven digital chaotic dynamics processor (CDP) supporting both simulated annealing (SA) and reservoir computing (RC). CBM-Dual enables real-time decision-making and lightweight adaptation for autonomous Edge AI, employing the largest-scale fully connected 1024-neuron chaotic Boltzmann machine (CBM). To address the high computational and area costs of digital CDPs, we propose: 1) a CBM-specific scheduler that exploits an inherently low neuron flip rate to reduce multiply-accumulate operations by 99%, and 2) an efficient multiply splitting scheme that reduces the area by 59%. Fabricated in 65nm (12mm$^2$), CBM-Dual achieves simultaneous heterogeneous task execution and state-of-the-art energy efficiency, delivering $\times$25-54 and $\times$4.5 improvements in the SA and RC fields, respectively.

关键词: Chaotic Boltzmann Machine, Simulated Annealing, Reservoir Computing, Edge AI, Digital Processor, Energy Efficiency, Hardware Optimization, Autonomous Systems

295. ❌ The Rhetoric of Machine Learning

作者: Robert C. Williamson 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06754v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《The Rhetoric of Machine Learning》从修辞学角度批判性地分析机器学习技术，认为其本质是说服性的而非客观的，并探讨了其在商业模型中的应用。论文主题是机器学习的社会影响和哲学批判，而非大模型、深度学习技术原理或具体应用创新。所有关键词均涉及大模型技术、优化方法、应用领域等具体技术内容，与论文的批判性人文视角完全无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文从修辞学视角批判性地分析机器学习技术，认为其本质是说服性的而非客观的，并探讨了其在'操纵即服务'商业模型中的应用。

摘要翻译

本文从修辞学（即说服的艺术）视角审视机器学习技术。我认为，机器学习本质上具有修辞属性，而非一种从数据中构建“世界模型”的中立且“客观”的方法。我将探讨其若干修辞特征，并分析一种广泛运用机器学习的普遍商业模式——“操纵即服务”。

摘要 (Abstract)

I examine the technology of machine learning from the perspective of rhetoric, which is simply the art of persuasion. Rather than being a neutral and “objective” way to build “world models” from data, machine learning is (I argue) inherently rhetorical. I explore some of its rhetorical features, and examine one pervasive business model where machine learning is widely used, “manipulation as a service.”

关键词: machine learning, rhetoric, persuasion, world models, manipulation as a service, business model, critical analysis

296. ❌ Busemann energy-based attention for emotion analysis in Poincaré discs

作者: Zinaid Kapić, Vladimir Jaćimović 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06752v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文提出了一种基于双曲几何的深度学习架构EmBolic，用于文本消息的细粒度情感分析。论文核心是双曲几何在情感分析中的应用，包括双曲注意力机制和Busemann能量计算。所有关键词均与大模型技术、训练方法、推理优化、代理系统等无关，因此除’AI for Science OR Bioinformatics OR Cheminformatics’外，其他关键词均得0分。‘AI for Science’得5分，因为情感分析可视为AI在心理学/社会科学领域的应用，但论文未明确提及科学领域（如生物信息学），且重点在几何方法而非大模型，因此相关性有限。

!!! tip deepseek-chat TL;DR

该研究提出了一种基于双曲几何的新型深度学习架构EmBolic，用于从文本消息中进行细粒度情感分析，通过双曲注意力机制和Busemann能量计算情感对齐，实验表明该方法具有良好的泛化能力和预测准确性。

摘要翻译

本文提出EmBolic——一种新颖的完全双曲深度学习架构，用于从文本信息中进行细粒度情感分析。其核心思想是双曲几何能有效捕捉词语与情感之间的层次结构。在本研究语境中，这些层次关系源于语义模糊性。EmBolic旨在推断连续情感空间上的曲率，而非将情感视为缺乏度量结构的离散类别集合。该架构的核心是双曲圆盘中的注意力机制：模型被训练为从文本信息生成查询（双曲圆盘中的点），而键（边界上的点）则从生成的查询中自动涌现。预测基于查询与键之间的布塞曼能量，用于评估特定文本信息与代表情感类别的方向之间的契合程度。实验表明，即使在表征空间维度较小的情况下，该模型仍展现出强大的泛化能力和较好的预测准确性。总体而言，本研究证实了我们的观点：情感计算是双曲表征特别具有优势的应用领域之一。

摘要 (Abstract)

We present EmBolic - a novel fully hyperbolic deep learning architecture for fine-grained emotion analysis from textual messages. The underlying idea is that hyperbolic geometry efficiently captures hierarchies between both words and emotions. In our context, these hierarchical relationships arise from semantic ambiguities. EmBolic aims to infer the curvature on the continuous space of emotions, rather than treating them as a categorical set without any metric structure. In the heart of our architecture is the attention mechanism in the hyperbolic disc. The model is trained to generate queries (points in the hyperbolic disc) from textual messages, while keys (points at the boundary) emerge automatically from the generated queries. Predictions are based on the Busemann energy between queries and keys, evaluating how well a certain textual message aligns with the class directions representing emotions. Our experiments demonstrate strong generalization properties and reasonably good prediction accuracy even for small dimensions of the representation space. Overall, this study supports our claim that affective computing is one of the application domains where hyperbolic representations are particularly advantageous.

关键词: hyperbolic geometry, emotion analysis, attention mechanism, Busemann energy, deep learning architecture, textual messages, affective computing, hierarchical representations

297. ❌ Beyond Pessimism: Offline Learning in KL-regularized Games

作者: Yuheng Zhang, Claire Chen, Nan Jiang 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06738v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究KL正则化两人零和博弈中的离线学习，属于强化学习/博弈论领域，与所有关键词（均聚焦大模型/深度学习技术原理、应用或相关技术）无直接关联。

!!! tip deepseek-chat TL;DR

该论文研究了KL正则化两人零和博弈中的离线学习问题，提出了一种无需悲观估计的新算法和分析框架，首次实现了样本复杂度为O(1/n)的快速统计率。

摘要翻译

我们研究KL正则化双人零和博弈中的离线学习问题，其中策略在固定参考策略的KL约束下进行优化。先前研究依赖于悲观值估计来处理分布偏移，仅能获得$\widetilde{\mathcal{O}}(1/\sqrt n)$的统计速率。针对KL正则化博弈，我们基于KL正则化最优响应的光滑性以及由斜对称性诱导的纳什均衡（Nash equilibrium）的稳定性，提出了一种新的免悲观算法与分析框架。这为KL正则化零和博弈中的离线学习首次实现了完全无需悲观估计的$\widetilde{\mathcal{O}}(1/n)$样本复杂度界限。我们进一步提出一种高效的自博弈策略优化算法，并证明在迭代次数与样本量呈线性关系时，该算法能达到与极小极大估计器相同的快速$\widetilde{\mathcal{O}}(1/n)$统计速率。

摘要 (Abstract)

We study offline learning in KL-regularized two-player zero-sum games, where policies are optimized under a KL constraint to a fixed reference policy. Prior work relies on pessimistic value estimation to handle distribution shift, yielding only $\widetilde{\mathcal{O}}(1/\sqrt n)$ statistical rates. We develop a new pessimism-free algorithm and analytical framework for KL-regularized games, built on the smoothness of KL-regularized best responses and a stability property of the Nash equilibrium induced by skew symmetry. This yields the first $\widetilde{\mathcal{O}}(1/n)$ sample complexity bound for offline learning in KL-regularized zero-sum games, achieved entirely without pessimism. We further propose an efficient self-play policy optimization algorithm and prove that, with a number of iterations linear in the sample size, it achieves the same fast $\widetilde{\mathcal{O}}(1/n)$ statistical rate as the minimax estimator.

关键词: offline learning, KL-regularized games, zero-sum games, pessimism-free algorithm, sample complexity, Nash equilibrium, policy optimization, statistical rate

298. ❌ Extraction of linearized models from pre-trained networks via knowledge distillation

作者: Fumito Kimura, Jun Ohkubo 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06732v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究从预训练神经网络中提取线性化模型，仅与’Pre-training OR Continual Pre-training OR Domain Adaptation’关键词有中等关联（5分），因为涉及预训练网络的知识蒸馏。其他关键词均与大模型、深度学习技术原理或科学AI应用无关，论文聚焦于传统神经网络线性化方法，未涉及LLMs、MoE、推理、对齐、压缩等大模型相关技术。

!!! tip deepseek-chat TL;DR

该论文提出了一种结合Koopman算子理论和知识蒸馏的框架，用于从预训练神经网络中提取线性化模型，在MNIST和Fashion-MNIST数据集上相比传统最小二乘Koopman近似方法取得了更高的分类准确率和数值稳定性。

摘要翻译

硬件领域的最新进展，如光子集成电路和光学器件，正推动着针对线性运算定制机器学习架构的研究需求。因此，探索在简单非线性预处理后仅通过线性运算构建学习机的方法具有重要价值。本研究提出一种框架，通过将库普曼算子（Koopman operator）理论与知识蒸馏相结合，从预训练神经网络中提取线性化模型以完成分类任务。在MNIST和Fashion-MNIST数据集上的数值实验表明，所提模型在分类精度和数值稳定性方面均持续优于传统的基于最小二乘法的库普曼近似方法。

摘要 (Abstract)

Recent developments in hardware, such as photonic integrated circuits and optical devices, are driving demand for research on constructing machine learning architectures tailored for linear operations. Hence, it is valuable to explore methods for constructing learning machines with only linear operations after simple nonlinear preprocessing. In this study, we propose a framework to extract a linearized model from a pre-trained neural network for classification tasks by integrating Koopman operator theory with knowledge distillation. Numerical demonstrations on the MNIST and the Fashion-MNIST datasets reveal that the proposed model consistently outperforms the conventional least-squares-based Koopman approximation in both classification accuracy and numerical stability.

关键词: knowledge distillation, pre-trained neural networks, linearized models, Koopman operator theory, classification tasks, MNIST, Fashion-MNIST, numerical stability

299. ❌ Bi-level Heterogeneous Learning for Time Series Foundation Models: A Federated Learning Approach

作者: Shengchao Chen, Guodong Long, Dikai Liu, Jing Jiang 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06727v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于时间序列基础模型（TSFMs）的训练方法，属于基础模型（Foundation Models）范畴，因此与第一个关键词高度相关（8分）。论文提出了一种联邦学习方法来解决异构数据问题，这涉及领域适应（Domain Adaptation）和预训练（Pre-training）过程，因此与第五个关键词相关（8分）。论文属于AI在科学领域的应用（时间序列分析），因此与最后一个关键词有一定关联（5分）。其他关键词（如MoE、SFT、RLHF、RAG、推理加速等）在论文中未涉及，因此得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于联邦学习的双层异构学习方法，用于训练时间序列基础模型，以解决异构时间序列数据中的梯度冲突和表示质量下降问题，实验表明该方法在点预测和概率预测上优于集中式和联邦式基线，并实现了有竞争力的零样本性能。

摘要翻译

时间序列数据的异质性比视觉或语言领域更为显著，因为不同领域和任务间的时间动态特性差异巨大。现有从头训练时间序列基础模型（TSFMs）的研究通常采用混合批次策略，即合并大规模数据集进行训练，但这可能导致梯度冲突并降低表征质量。为解决这一问题，我们提出了一种细粒度学习方法，该方法能从异构序列中蒸馏不变知识，同时减少跨域干扰。我们从两个层面刻画异质性：域间异质性与域内异质性。针对这种双层异质性，我们设计了一种联邦学习方法：通过局部正则化强制学习域不变且语义一致的表征以缓解域内冲突，并通过域感知聚合增强跨域协作以处理域间差异。在多样化基准测试上的实验表明，采用本方法训练的TSFMs在点预测和概率预测任务中均持续优于集中式和联邦式TSFM基线模型，同时在大规模零样本预测中也取得了有竞争力的性能，为在异构环境中从头训练TSFMs提供了一条灵活路径。

摘要 (Abstract)

Heterogeneity in time series data is more pronounced than in vision or language, as temporal dynamics vary substantially across domains and tasks. Existing efforts on training time series foundation models (TSFMs) from scratch are often trained with mixed-batch strategies that merge large-scale datasets, which can cause gradient conflicts and degrade representation quality. To address this, we propose a fine-grained learning method that distills invariant knowledge from heterogeneous series while reducing cross-domain interference. We characterize heterogeneity at two levels: inter-domain and intra-domain. To tackle this bi-level heterogeneity, we design a federated learning method that mitigates intra-domain conflicts by enforcing domain-invariant and semantically consistent representations through local regularization, and addresses inter-domain discrepancies by enhancing cross-domain collaboration via domain-aware aggregation. Experiments across diverse benchmarks show that TSFMs trained with our method consistently outperform both centralized and federated TSFM baselines in point and probabilistic forecasting, while also achieving competitive zero-shot performance at scale, offering a flexible pathway for training TSFMs from scratch in heterogeneous environments.

关键词: Time Series Foundation Models, Federated Learning, Heterogeneous Data, Domain Adaptation, Representation Learning, Zero-shot Performance, Forecasting

300. ❌ CASE: Cadence-Aware Set Encoding for Large-Scale Next Basket Repurchase Recommendation

作者: Yanan Cao, Ashish Ranjan, Sinduja Subramaniam, Evren Korpeoglu, Kaushiki Nag, Kannan Achan 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06718v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于零售推荐系统中的时序建模和集合编码技术，研究的是传统推荐算法（CASE模型），不涉及大语言模型、深度学习技术原理创新或AI for Science等关键词。论文内容与所有评分关键词均无直接关联，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了CASE模型来解决大规模零售推荐中基于购买节奏的下一次篮子复购预测问题，通过日历时间建模和多尺度时序卷积显著提升了推荐精度和召回率。

摘要翻译

复购行为是大规模零售推荐中的核心信号，尤其在补货频繁的品类中：用户下一个购物篮中的许多商品曾在此前被购买，且其购买时机遵循稳定、商品特定的复购周期。然而，大多数下一篮复购推荐模型将历史表示为按访问顺序索引的离散购物篮事件序列，无法显式建模日历时间的流逝，也无法在购买间隔天数推移时更新商品排序。本文提出CASE（面向下一篮复购推荐的周期感知集合编码），该方法将商品级周期学习与跨商品交互解耦，在保持生产可扩展性的同时实现显式的日历时间建模。CASE将每个商品的购买历史表示为固定时间范围内的日历时间信号，应用共享多尺度时间卷积以捕捉重复的周期模式，并利用诱导集合注意力以亚二次复杂度建模跨商品依赖关系，从而支持大规模的高效批量推理。在三个公开基准数据集和一个专有数据集上，与强大的下一篮预测基线模型相比，CASE在多个截断值下的精确率、召回率和NDCG指标均持续提升。在涉及数千万用户和海量商品目录的生产规模评估中，CASE在top-5推荐中实现了最高8.6%的相对精确率提升和9.9%的召回率提升，这表明可扩展的周期感知建模在基准环境和工业场景中均能带来可衡量的性能增益。

摘要 (Abstract)

Repurchase behavior is a primary signal in large-scale retail recommendation, particularly in categories with frequent replenishment: many items in a user’s next basket were previously purchased and their timing follows stable, item-specific cadences. Yet most next basket repurchase recommendation models represent history as a sequence of discrete basket events indexed by visit order, which cannot explicitly model elapsed calendar time or update item rankings as days pass between purchases. We present CASE (Cadence-Aware Set Encoding for next basket repurchase recommendation), which decouples item-level cadence learning from cross-item interaction, enabling explicit calendar-time modeling while remaining production-scalable. CASE represents each item’s purchase history as a calendar-time signal over a fixed horizon, applies shared multi-scale temporal convolutions to capture recurring rhythms, and uses induced set attention to model cross-item dependencies with sub-quadratic complexity, allowing efficient batch inference at scale. Across three public benchmarks and a proprietary dataset, CASE consistently improves Precision, Recall, and NDCG at multiple cutoffs compared to strong next basket prediction baselines. In a production-scale evaluation with tens of millions of users and a large item catalog, CASE achieves up to 8.6% relative Precision and 9.9% Recall lift at top-5, demonstrating that scalable cadence-aware modeling yields measurable gains in both benchmark and industrial settings.

关键词: next basket recommendation, repurchase behavior, cadence-aware modeling, temporal convolutions, set encoding, large-scale recommendation, retail recommendation, production-scalable

301. ❌ Bi-Lipschitz Autoencoder With Injectivity Guarantee

作者: Qipeng Zhan, Zhuoping Zhou, Zexuan Wang, Qi Long, Li Shen 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06701v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是自编码器（Autoencoder）的改进方法，专注于解决编码器非单射性和几何保持问题，提出了Bi-Lipschitz Autoencoder（BLAE）。论文内容完全围绕传统自编码器的正则化、几何保持和鲁棒性展开，没有涉及任何大语言模型（LLM）、深度学习技术原理创新或大模型在不同领域的应用。所有评分关键词均与大模型、深度学习技术或AI科学应用相关，而本文属于经典的机器学习/表示学习范畴，与这些关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文针对自编码器在降维过程中存在的非单射映射和几何保持不足问题，提出了Bi-Lipschitz Autoencoder（BLAE），通过注入正则化和双Lipschitz松弛来确保单射性并提升对数据分布漂移的鲁棒性，实验表明其在多种数据集上优于现有方法。

摘要翻译

自编码器被广泛应用于降维任务，其基本假设是高维数据存在于低维流形上。正则化自编码器旨在降维过程中保持流形几何结构，但现有方法常面临非单射映射和过于僵化的约束问题，这限制了其有效性与鲁棒性。本研究发现，编码器的非单射性是导致收敛困难与潜在表示扭曲的核心瓶颈。为确保模型在不同数据分布下的鲁棒性，我们形式化了“可容许正则化”的概念，并给出了其成立的充分条件。本文提出双利普希茨自编码器（Bi-Lipschitz Autoencoder, BLAE），其包含两项关键创新：（1）基于分离准则的单射正则化方案，以消除病态局部极小值；（2）双利普希茨松弛约束，在保持几何结构的同时对数据分布漂移具有鲁棒性。在多组数据集上的实验结果表明，BLAE在保持流形结构方面持续优于现有方法，且对采样稀疏性与分布偏移具有强适应性。代码公开于 https://github.com/qipengz/BLAE。

摘要 (Abstract)

Autoencoders are widely used for dimensionality reduction, based on the assumption that high-dimensional data lies on low-dimensional manifolds. Regularized autoencoders aim to preserve manifold geometry during dimensionality reduction, but existing approaches often suffer from non-injective mappings and overly rigid constraints that limit their effectiveness and robustness. In this work, we identify encoder non-injectivity as a core bottleneck that leads to poor convergence and distorted latent representations. To ensure robustness across data distributions, we formalize the concept of admissible regularization and provide sufficient conditions for its satisfaction. In this work, we propose the Bi-Lipschitz Autoencoder (BLAE), which introduces two key innovations: (1) an injective regularization scheme based on a separation criterion to eliminate pathological local minima, and (2) a bi-Lipschitz relaxation that preserves geometry and exhibits robustness to data distribution drift. Empirical results on diverse datasets show that BLAE consistently outperforms existing methods in preserving manifold structure while remaining resilient to sampling sparsity and distribution shifts. Code is available at https://github.com/qipengz/BLAE.

关键词: Autoencoder, Dimensionality Reduction, Manifold Learning, Bi-Lipschitz, Injective Regularization, Robustness, Data Distribution Shift, Geometry Preservation

302. ❌ Towards Accurate and Calibrated Classification: Regularizing Cross-Entropy From A Generative Perspective

作者: Qipeng Zhan, Zhuoping Zhou, Li Shen 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06689v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于深度学习分类器的准确性和校准问题，提出了一种新的损失函数（Generative Cross-Entropy），并在图像分类基准（CIFAR-10/100, Tiny-ImageNet）和医学成像基准上进行了验证。论文内容与所有评分关键词（均围绕大模型技术、训练方法、推理优化、应用领域等）无直接关联，未涉及大模型、语言模型、MoE、缩放定律、预训练/后训练、对齐、RAG、上下文扩展、注意力优化、推理技术、智能体、量化、解码加速、幻觉缓解、可解释性、世界模型、模型合并、上下文学习或科学AI等主题。

!!! tip deepseek-chat TL;DR

该论文针对深度神经网络分类中准确性与校准性的权衡问题，提出了一种从生成视角正则化交叉熵的新损失函数（GCE），在多个基准测试中同时提高了分类准确性和置信度校准性能。

摘要翻译

准确的分类不仅需要较高的预测精度，还需要经过良好校准的置信度估计。然而，现代深度神经网络（DNNs）常常表现出过度自信，这主要源于其在负对数似然（Negative Log-Likelihood, NLL）上的过拟合。尽管焦点损失（focal loss）的变体缓解了这一问题，但它们通常会降低准确率，这揭示了校准性能与预测性能之间持久的权衡。受生成式分类器与判别式分类器互补优势的启发，我们提出了生成式交叉熵（Generative Cross-Entropy, GCE），该方法最大化 $p(x|y)$，其等效于增加了类别级置信度正则化项的交叉熵。在温和条件下，GCE 是严格适宜的。在 CIFAR-10/100、Tiny-ImageNet 以及一个医学影像基准测试中，GCE 相较于交叉熵在准确率和校准度上均有提升，尤其在长尾分布场景下。结合自适应分段温度缩放（Adaptive Piecewise Temperature Scaling, ATS），GCE 实现了与焦点损失变体相竞争的校准性能，同时未牺牲准确率。

摘要 (Abstract)

Accurate classification requires not only high predictive accuracy but also well-calibrated confidence estimates. Yet, modern deep neural networks (DNNs) are often overconfident, primarily due to overfitting on the negative log-likelihood (NLL). While focal loss variants alleviate this issue, they typically reduce accuracy, revealing a persistent trade-off between calibration and predictive performance. Motivated by the complementary strengths of generative and discriminative classifiers, we propose Generative Cross-Entropy (GCE), which maximizes $p(x|y)$ and is equivalent to cross-entropy augmented with a class-level confidence regularizer. Under mild conditions, GCE is strictly proper. Across CIFAR-10/100, Tiny-ImageNet, and a medical imaging benchmark, GCE improves both accuracy and calibration over cross-entropy, especially in the long-tailed scenario. Combined with adaptive piecewise temperature scaling (ATS), GCE attains calibration competitive with focal-loss variants without sacrificing accuracy.

关键词: classification, calibration, cross-entropy, generative classifier, confidence estimation, deep neural networks, loss function, medical imaging

303. ❌ GraphWalker: Graph-Guided In-Context Learning for Clinical Reasoning on Electronic Health Records

作者: Yue Fang, Weibin Liao, Yuxin Guo, Jiaran Gao, Hongxin Ding, Jinyang Zhang, Xinke Jiang, Zhibang Yang, Junfeng Zhao, Yasha Wang, Liantao Ma 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06684v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	10.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）在电子健康记录（EHRs）临床推理中的上下文学习（ICL）应用，属于大模型在生物医学领域的创新应用。因此，与’Large Language Models OR LLMs OR Foundation Models’、‘In-context Learning OR Many-shot Learning’和’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。论文未涉及其他关键词的具体技术（如MoE、SFT、RAG、量化等），故相关度为0分。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型在电子健康记录临床推理中上下文学习面临的视角局限、队列意识和信息聚合三大挑战，提出了GraphWalker框架，通过整合数据驱动和模型驱动视角、引入队列发现以及采用惰性贪婪搜索算法，显著提升了临床推理性能。

摘要翻译

电子健康记录（EHR）的临床推理是现代医疗保健中一项基础且具有挑战性的任务。尽管情境学习（ICL）为大型语言模型（LLM）在EHR推理中提供了一种有前景的推理时适应范式，但现有方法面临三个根本性挑战：（1）视角局限，即数据驱动的相似性无法与LLM推理需求对齐，而模型驱动的信号受限于有限的临床能力；（2）队列感知缺失，即示例选择独立进行，未建模群体层面的结构；（3）信息聚合不足，即忽略示例间的冗余和交互效应，导致边际收益递减。为应对这些挑战，我们提出了GraphWalker，一个面向EHR的ICL原则性示例选择框架。GraphWalker（i）通过整合数据驱动和模型驱动视角，联合建模患者临床信息和LLM估计的信息增益；（ii）引入队列发现机制以避免陷入噪声局部最优；（iii）采用带前沿扩展的惰性贪婪搜索算法来缓解信息聚合中的边际收益递减问题。在多个真实世界EHR基准上的大量实验表明，GraphWalker持续优于最先进的ICL基线方法，在临床推理性能上实现了显著提升。我们的代码已开源：https://github.com/PuppyKnightUniversity/GraphWalker

摘要 (Abstract)

Clinical Reasoning on Electronic Health Records (EHRs) is a fundamental yet challenging task in modern healthcare. While in-context learning (ICL) offers a promising inference-time adaptation paradigm for large language models (LLMs) in EHR reasoning, existing methods face three fundamental challenges: (1) Perspective Limitation, where data-driven similarity fails to align with LLM reasoning needs and model-driven signals are constrained by limited clinical competence; (2) Cohort Awareness, as demonstrations are selected independently without modeling population-level structure; and (3) Information Aggregation, where redundancy and interaction effects among demonstrations are ignored, leading to diminishing marginal gains. To address these challenges, we propose GraphWalker, a principled demonstration selection framework for EHR-oriented ICL. GraphWalker (i) jointly models patient clinical information and LLM-estimated information gain by integrating data-driven and model-driven perspectives, (ii) incorporates Cohort Discovery to avoid noisy local optima, and (iii) employs a Lazy Greedy Search with Frontier Expansion algorithm to mitigate diminishing marginal returns in information aggregation. Extensive experiments on multiple real-world EHR benchmarks demonstrate that GraphWalker consistently outperforms state-of-the-art ICL baselines, yielding substantial improvements in clinical reasoning performance. Our code is open-sourced at https://github.com/PuppyKnightUniversity/GraphWalker

关键词: Clinical Reasoning, Electronic Health Records, In-context Learning, Large Language Models, Demonstration Selection, GraphWalker, EHR, LLMs

304. ❌ Foundry: Template-Based CUDA Graph Context Materialization for Fast LLM Serving Cold Start

作者: Xueshen Liu, Yongji Wu, Yuncheng Yao, Danyang Zhuo, Ion Stoica, Z. Morley Mao 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06664v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	8.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	8.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于LLM服务系统的冷启动优化，通过Foundry系统显著减少CUDA图捕获时间，因此与’Large Language Models’高度相关（10分）。论文明确提到支持MoE模型（如Qwen3-235B-A22B），与’Mixture of Experts’相关（8分）。系统通过模板化加速图重建，属于推理加速技术，与’Speculative Decoding OR Inference Acceleration’相关（8分）。其他关键词涉及模型训练、对齐、推理方法、科学应用等，论文未涉及，均给0分。

!!! tip deepseek-chat TL;DR

论文解决了LLM服务中CUDA图捕获导致的冷启动延迟问题，提出了Foundry系统，通过模板化上下文物化将Qwen3-235B-A22B的初始化时间从10分钟减少到3.9秒。

摘要翻译

现代大型语言模型服务提供商日益依赖自动扩缩容与并行重构技术应对快速变化的工作负载，但冷启动延迟仍是主要瓶颈。尽管近期系统已将模型权重加载时间缩短至秒级，CUDA图捕获仍需数十秒至数分钟，且常成为启动过程的主导因素。然而，CUDA图无法直接序列化：除图拓扑结构外，其与执行上下文紧密耦合，包括嵌入内核参数的设备地址及预热阶段延迟加载的内核代码。现有方案要么依赖脆弱的特定内核补丁，要么采用笨重的进程级检查点/恢复机制，难以灵活适应动态并行切换。本文提出Foundry——一种基于模板的CUDA图上下文物化系统，通过在离线处理阶段持久化图拓扑与执行上下文，实现可执行图在线重构且开销可忽略不计。Foundry强制确定性的内存布局，自动提取并重载捕获图所需的内核二进制文件，并通过基于拓扑的模板化降低在线重构成本。针对分布式服务场景，Foundry进一步支持单GPU离线捕获生成多GPU部署模板，仅需修补与计算秩相关的通信状态。在涵盖稠密模型与参数量达2350亿的混合专家模型中，Foundry将冷启动延迟降低最高达99%，将Qwen3-235B-A22B的初始化时间从10分钟缩短至3.9秒，同时保持CUDA图的吞吐量优势。

摘要 (Abstract)

Modern LLM service providers increasingly rely on autoscaling and parallelism reconfiguration to respond to rapidly changing workloads, but cold-start latency remains a major bottleneck. While recent systems have reduced model weight loading to seconds, CUDA graph capture still takes tens of seconds to minutes and often dominates startup. Unfortunately, CUDA graphs cannot be naively serialized: beyond graph topology, they are tightly coupled to execution context, including device addresses embedded in kernel arguments and kernel code lazily loaded during warmup. Existing approaches either rely on brittle kernel-specific patching or heavyweight process-level checkpoint/restore that are inflexible to dynamic parallelism switching. We present Foundry, a template-based CUDA graph context materialization system that persists both graph topology and execution context during an offline processing stage, and reconstructs executable graphs online with negligible overhead. Foundry enforces deterministic memory layouts, automatically extracts and reloads kernel binaries required by captured graphs, and reduces online reconstruction costs through topology-based templating. For distributed serving, Foundry further enables a single-GPU offline capture to generate templates for multi-GPU deployments by patching only rank-dependent communication state. Across dense and MoE models up to 235B parameters, Foundry reduces cold-start latency by up to 99%, cutting the initialization time of Qwen3-235B-A22B from 10 minutes to 3.9 seconds while preserving the throughput gains of CUDA graphs.

关键词: LLM serving, cold-start latency, CUDA graph, template-based materialization, MoE models, inference acceleration, distributed serving, Qwen3-235B-A22B

305. ❌ Generation time in a discrete epidemic model with asymptomatic carriers: beyond geometric waiting times

作者: Jordi Ripoll, Joan Saldaña 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07309v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究传染病传播链中无症状携带者的代际时间分布，属于传统的数学流行病学建模领域。论文使用离散时间非马尔可夫模型分析潜伏期、无症状期和有症状期的随机等待时间，推导代际时间的概率分布和矩。所有评分关键词均涉及大模型、深度学习及相关技术（如MoE、RLHF、RAG、量化等），而本论文完全不涉及任何人工智能、机器学习或大模型技术，纯粹是数学流行病学研究。因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文通过离散时间非马尔可夫模型研究了有症状和无症状传染病传播链中的代际时间分布，推导了其概率分布和矩，并应用于数据驱动的流行病场景分析。

摘要翻译

我们研究了存在无症状携带者的传染病传播链中连续病例间的随机时间间隔。基于一个离散时间流行病模型，我们推导了这种代际时间（以天为单位）的概率分布，该模型考虑了随时间推移和不同感染阶段变化的传染性。所引入的非马尔可夫模型是一个紧凑的递归系统，其特征是在三个感染阶段（潜伏期、无症状期和有症状期）均存在随机等待时间。通过重新排列基本再生数（其表示一个最终可能出现症状的无症状原发病例所预期的继发病例数）的表达式，我们得到了代际时间的概率。期望代际时间是症状出现前后期望代际时间的凸组合。此外，我们的分析表明，代际时间的n阶矩与各阶段加权前向复发时间直至n阶的矩，以及潜伏期和潜伏期直至n阶的矩相关。这些权重是每个传播阶段随时间推移的传染性。最后，我们假设传染性仅随阶段变化且等待时间服从离散威布尔分布，模拟了几种数据驱动的流行病情景。除麻疹外，所分析的每种疾病在其各自的代际时间分布中都表现出中等程度的变异性。

摘要 (Abstract)

We study the random times between successive cases in a transmission chain of infectious diseases with asymptomatic carriers. We derive the probability distribution of this generation time (in days) from a discrete-time epidemic model with variable infectiousness both along elapsed times and across phases. The introduced non-Markovian model is a compact recursive system featuring random waiting times at each of the three infected stages: latent, asymptomatic, and symptomatic. By rearranging the terms of the basic reproduction number, which represents the expected number of secondary cases produced by an asymptomatic primary case who may eventually develop symptoms, we get to the generation-time probabilities. The expected generation time is a convex combination of the expected generation times before and after the onset of symptoms. Additionally, our analysis reveals that the n-th moment of the generation time is related to the moments up to n-th order of the weighted forward recurrence time at each phase and the moments up to n-th order of the latent period and the incubation period. These weights are the infectiousness along the elapsed times for each transmission phase. Finally, we illustrate several data-driven epidemic scenarios, assuming that infectiousness varies only across phases and discrete Weibull distributions for the waiting times. Each disease analyzed, except measles, exhibits moderate variability in its respective generation time distribution.

关键词: generation time, asymptomatic carriers, discrete-time epidemic model, non-Markovian model, infectious diseases, transmission chain, waiting times, basic reproduction number

306. ❌ FlowAdam: Implicit Regularization via Geometry-Aware Soft Momentum Injection

作者: Devender Singh, Tarun Sheel 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06652v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于优化算法（Adam的改进版本FlowAdam）的研究，涉及梯度流、隐式正则化和耦合参数优化。所有评分关键词均与大模型、深度学习技术原理或科学AI应用直接相关，而本文的核心是通用优化方法，不涉及特定的大模型架构、训练技术、推理加速、对齐方法、代理系统或科学领域应用。因此，所有关键词得分为0。

!!! tip deepseek-chat TL;DR

本文针对Adam优化器在处理耦合参数时表现不佳的问题，提出了FlowAdam——一种通过软动量注入结合梯度流ODE的混合优化器，在低秩矩阵/张量恢复和协同过滤等耦合优化任务上实现了显著的性能提升（10-22%的验证误差降低），同时在标准任务上保持与Adam相当的性能。

摘要翻译

诸如Adam的自适应矩估计方法采用基于梯度平方指数移动平均的对角坐标预处理器。这种对角缩放依赖于坐标系，且由于独立处理每个参数，在处理密集或旋转的参数耦合（包括矩阵分解、张量分解和图神经网络中的耦合）时可能面临困难。我们提出FlowAdam，这是一种混合优化器，通过常微分方程（ODE）的连续梯度流积分对Adam进行增强。当基于指数移动平均的统计量检测到优化地形困难时，FlowAdam切换至截断ODE积分。我们的核心贡献是软动量注入技术，该技术在模式转换过程中将ODE速度与Adam动量进行融合。这避免了朴素混合方法中观察到的训练崩溃现象。在耦合优化基准测试中，ODE积分提供了隐式正则化，使低秩矩阵/张量恢复的留出误差降低10-22%，在Jester（现实世界协同过滤）数据集上降低6%，同时超越经调优的Lion和AdaBelief优化器，而在良条件工作负载（CIFAR-10）上保持与Adam相当的性能。MovieLens-100K实验证实，这些优势确实源于耦合参数相互作用而非偏差估计。消融研究表明软注入技术至关重要，硬性替换会使准确率从100%降至82.5%。

摘要 (Abstract)

Adaptive moment methods such as Adam use a diagonal, coordinate-wise preconditioner based on exponential moving averages of squared gradients. This diagonal scaling is coordinate-system dependent and can struggle with dense or rotated parameter couplings, including those in matrix factorization, tensor decomposition, and graph neural networks, because it treats each parameter independently. We introduce FlowAdam, a hybrid optimizer that augments Adam with continuous gradient-flow integration via an ordinary differential equation (ODE). When EMA-based statistics detect landscape difficulty, FlowAdam switches to clipped ODE integration. Our central contribution is Soft Momentum Injection, which blends ODE velocity with Adam’s momentum during mode transitions. This prevents the training collapse observed with naive hybrid approaches. Across coupled optimization benchmarks, the ODE integration provides implicit regularization, reducing held-out error by 10-22% on low-rank matrix/tensor recovery and 6% on Jester (real-world collaborative filtering), also surpassing tuned Lion and AdaBelief, while matching Adam on well-conditioned workloads (CIFAR-10). MovieLens-100K confirms benefits arise specifically from coupled parameter interactions rather than bias estimation. Ablation studies show that soft injection is essential, as hard replacement reduces accuracy from 100% to 82.5%.

关键词: FlowAdam, optimizer, Adam, gradient flow, implicit regularization, coupled parameters, soft momentum injection, ODE integration

307. ❌ MAT-Cell: A Multi-Agent Tree-Structured Reasoning Framework for Batch-Level Single-Cell Annotation

作者: Yehui Yang, Zelin Zang, Changxi Chi, Jingbo Zhou, Xienan Zheng, Yuzhe Jia, Chang Yu, Jinlin Wu, Fuji Yang, Jiebo Luo, Zhen Lei, Stan Z. Li 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06269v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文MAT-Cell提出了一种用于单细胞注释的多智能体树结构推理框架，核心创新在于将神经符号推理与检索增强生成（RAG）结合，并采用多智能体系统进行辩证验证。该研究与以下关键词高度相关（10分）：1）‘Large Language Models’（论文明确使用LLMs解决单细胞分析问题）；2）‘Retrieval-Augmented Generation’（采用自适应RAG注入符号约束）；3）‘Chain of Thought’和’System 2 Thinking’（框架构建可验证的证明生成和深度推理树）；4）‘LLM Agents’和’Multi-agent Systems’（使用同质反驳智能体进行审计和剪枝）；5）‘AI for Science’（应用于生物信息学中的单细胞分析）。其他关键词如MoE、量化、RLHF等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对单细胞注释中监督方法泛化性差和LLMs噪声关联的问题，提出了MAT-Cell框架，通过神经符号推理和检索增强生成结合多智能体辩证验证，显著提升了跨物种基准测试的性能并保持了鲁棒性。

摘要翻译

自动化细胞推理面临一个核心二分困境：监督学习方法陷入“参考陷阱”，难以泛化至分布外的细胞状态；而缺乏生物学先验基础的大型语言模型则受困于“信噪比悖论”，易产生虚假关联。我们提出MAT-Cell——一种神经符号推理框架，将单细胞分析从黑箱分类重构为可构建、可验证的证明生成过程。该框架通过自适应检索增强生成技术注入符号约束，使神经推理扎根于生物学公理并降低转录组噪声。进一步采用同质反驳智能体参与的辩证验证流程，对推理路径进行审计与剪枝，形成强制逻辑一致性的三段论推导树。在大规模跨物种基准测试中，MAT-Cell显著优于当前最优模型，并在基线方法严重失效的挑战性场景中保持稳健性能。代码发布于https://github.com/jiangliu91/MAT-Cell-A-Multi-Agent-Tree-Structured-Reasoning-Framework-for-Batch-Level-Single-Cell-Annotation。

摘要 (Abstract)

Automated cellular reasoning faces a core dichotomy: supervised methods fall into the Reference Trap and fail to generalize to out-of-distribution cell states, while large language models (LLMs), without grounded biological priors, suffer from a Signal-to-Noise Paradox that produces spurious associations. We propose MAT-Cell, a neuro-symbolic reasoning framework that reframes single-cell analysis from black-box classification into constructive, verifiable proof generation. MAT-Cell injects symbolic constraints through adaptive Retrieval-Augmented Generation (RAG) to ground neural reasoning in biological axioms and reduce transcriptomic noise. It further employs a dialectic verification process with homogeneous rebuttal agents to audit and prune reasoning paths, forming syllogistic derivation trees that enforce logical consistency.Across large-scale and cross-species benchmarks, MAT-Cell significantly outperforms state-of-the-art (SOTA) models and maintains robust per-formance in challenging scenarios where baselinemethods severely degrade. Code is available at https://gith ub.com/jiangliu91/MAT-Cell-A-Mul ti-Agent-Tree-Structured-Reasoni ng-Framework-for-Batch-Level-Sin gle-Cell-Annotation.

关键词: Multi-Agent Systems, Tree-Structured Reasoning, Retrieval-Augmented Generation, Single-Cell Annotation, Neuro-Symbolic Reasoning, LLM Agents, Biological Axioms, Dialectic Verification

308. ❌ ToxReason: A Benchmark for Mechanistic Chemical Toxicity Reasoning via Adverse Outcome Pathway

作者: Jueon Park, Wonjune Jang, Chanhwi Kim, Yein Park, Jaewoo Kang 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06264v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究LLMs在化学毒性预测中的机理推理能力，与’Large Language Models’高度相关（10分），涉及’Chain of Thought’和’System 2 Thinking’的推理过程（各10分），关注’Mechanistic Interpretability’（10分），属于’AI for Science’应用（10分）。‘Hallucination Mitigation’因论文关注生物真实性而有一定关联（5分）。其他关键词如MoE、SFT、RAG等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对现有LLMs在化学毒性预测中缺乏机理推理评估的问题，提出了基于不良结局通路的ToxReason基准，发现预测性能与推理可靠性不必然相关，并通过推理感知训练提升了机理推理和毒性预测性能。

摘要翻译

近年来，大型语言模型（LLM）的发展已能支持基于分子推理的性质预测。然而，毒性源于化学结构之外复杂的生物学机制，因此需要机制性推理以实现可靠预测。尽管这一能力至关重要，现有基准测试却未能系统性地评估它。大型语言模型能够生成流畅但缺乏生物学真实性的解释，这使得评估所预测的毒性是否基于有效机制变得困难。为填补这一空白，我们提出了ToxReason——一个基于不良结局路径（Adverse Outcome Pathway, AOP）构建的基准测试，用于评估跨多个器官的器官水平毒性推理能力。ToxReason整合了实验性的药物-靶点相互作用证据与毒性标签，要求模型从分子起始事件（Molecular Initiating Event, MIE）到不良结局（Adverse Outcome, AO）推断毒性结果及其潜在机制。利用ToxReason，我们评估了多种大型语言模型在毒性预测性能和推理质量上的表现。研究发现，强大的预测性能并不必然意味着可靠的推理能力。此外，我们证明，融入推理意识的训练能够改善机制推理能力，进而提升毒性预测性能。这些结果共同强调了将推理整合到评估与训练中对于构建可信赖的毒性建模的必要性。

摘要 (Abstract)

Recent advances in large language models (LLMs) have enabled molecular reasoning for property prediction. However, toxicity arises from complex biological mechanisms beyond chemical structure, necessitating mechanistic reasoning for reliable prediction. Despite its importance, current benchmarks fail to systematically evaluate this capability. LLMs can generate fluent but biologically unfaithful explanations, making it difficult to assess whether predicted toxicities are grounded invalid mechanisms. To bridge this gap, we introduce ToxReason, a benchmark grounded in the Adverse Outcome Pathway (AOP) that evaluates organ-level toxicity reasoning across multiple organs. ToxReason integrates experimental drug-target interaction evidence with toxicity labels, requiring models to infer both toxic outcomes and their underlying mechanisms from Molecular Initiating Event (MIE) to Adverse Outcome (AO). Using ToxReason, we evaluate toxicity prediction performance and reasoning quality across diverse LLMs. We find that strong predictive performance does not necessarily imply reliable reasoning. Furthermore, we show that reasoning-aware training improves mechanistic reasoning and, consequently, toxicity prediction performance. Together, these results underscore the necessity of integrating reasoning into both evaluation and training for trustworthy toxicity modeling.

关键词: Large Language Models, Mechanistic Reasoning, Chemical Toxicity, Adverse Outcome Pathway, Benchmark Evaluation, Toxicity Prediction, Explainable AI, AI for Science

309. ❌ From Exposure to Internalization: Dual-Stream Calibration for In-context Clinical Reasoning

作者: Chuang Zhao, Hongke Zhao, Xiaofang Zhou, Xiaomeng Li 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06262v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	15.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出Dual-Stream Calibration (DSC)框架，专注于临床推理任务，核心涉及in-context learning (ICL)和retrieval-augmented generation (RAG)，因此这些关键词高度相关（10-15分）。论文提到超越state-of-the-art fine-tuning，与Post-training/SFT相关（8分）。框架强调deliberative reflection和internalization，与Chain of Thought、System 2 Thinking和Self-Correction相关（8-10分）。应用领域为临床AI，与AI for Science相关（10分）。论文基于大模型，但未指定类型，给Large Language Models基础分（8分）。其他关键词如MoE、Scaling Laws、RLHF等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对临床推理中模型难以内部化复杂上下文的问题，提出了Dual-Stream Calibration框架，通过语义和结构校准流在测试时动态调整模型内部表示，在13个临床数据集上超越了现有方法。

摘要翻译

情境化临床推理要求基于复杂、异构的临床记录进行稳健的推断。尽管当前最先进的微调、情境学习（ICL）和检索增强生成（RAG）技术能够实现知识暴露，但它们往往未能实现真正的语境内化：即在推理时根据个体病例的细微差别动态调整模型的内部表征。为解决这一问题，我们提出双流校准（Dual-Stream Calibration, DSC），一种测试时训练框架，它超越了表面的知识暴露，在推理过程中实现深度内化。DSC通过协同对齐两个校准流来促进输入内化。与被动的情境暴露不同，语义校准流强制对核心证据进行审慎反思，通过最小化熵来内化语义锚点，从而稳定生成轨迹。同时，结构校准流通过迭代元学习目标吸收潜在的推理依赖关系。通过在测试时使用专门的支持集进行训练，该流使模型能够弥合外部证据与内部逻辑之间的差距，将碎片化数据合成为连贯的响应。我们的方法将推理范式从基于注意力的被动匹配转变为对潜在推理空间的主动精炼。在十三个临床数据集上的验证表明，DSC在三种不同的任务范式中均表现出优越性，持续超越了从依赖训练的模型到测试时学习框架的一系列最先进基线。

摘要 (Abstract)

Contextual clinical reasoning demands robust inference grounded in complex, heterogeneous clinical records. While state-of-the-art fine-tuning, in-context learning (ICL), and retrieval-augmented generation (RAG) enable knowledge exposure, they often fall short of genuine contextual internalization: dynamically adjusting a model’s internal representations to the subtle nuances of individual cases at inference time. To address this, we propose Dual-Stream Calibration (DSC), a test-time training framework that transcends superficial knowledge exposure to achieve deep internalization during inference. DSC facilitates input internalization by synergistically aligning two calibration streams. Unlike passive context exposure, the Semantic Calibration Stream enforces a deliberative reflection on core evidence, internalizing semantic anchors by minimizing entropy to stabilize generative trajectories. Simultaneously, the Structural Calibration Stream assimilates latent inferential dependencies through an iterative meta-learning objective. By training on specialized support sets at test-time, this stream enables the model to bridge the gap between external evidence and internal logic, synthesizing fragmented data into a coherent response. Our approach shifts the reasoning paradigm from passive attention-based matching to an active refinement of the latent inferential space. Validated against thirteen clinical datasets, DSC demonstrates superiority across three distinct task paradigms, consistently outstripping state-of-the-art baselines ranging from training-dependent models to test-time learning frameworks.

关键词: clinical reasoning, in-context learning, retrieval-augmented generation, test-time training, dual-stream calibration, internalization, semantic calibration, structural calibration

310. ❌ Explicit Electric Potential-Embedded Machine Learning Framework: A Unified Description from Atomic to Electronic Scales

作者: Jingwen Zhou, Yawen Yu, Xuwei Liu, Chungen Liu 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07322v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文专注于电化学界面的机器学习模拟，使用基于MACE架构的图神经网络（PE-MACE）和电子密度预测模型（PE-EDP），属于AI for Science（科学AI）领域，具体涉及电化学模拟和材料科学，与关键词’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（8分）。然而，论文未涉及大语言模型（LLMs）、深度学习技术原理创新（如MoE、Scaling Laws、微调方法等）、推理技术（如CoT、MCTS）、代理系统、模型优化（如量化、推理加速）或通用世界模型，因此其他所有关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种显式电势嵌入的机器学习框架，用于同时预测电化学界面中的原子力和电子密度分布，在Pt(111)/水界面模型中验证了其高精度和模拟能力。

摘要翻译

为进一步发展电化学界面的精确大规模模拟，我们提出了一个统一的显式电势框架，用于同时预测原子力与电子密度分布。该框架包含三个组成部分：数据生成、模型训练与应用。数据生成模块在Hy-DFT中实现，能够在恒电势从头算分子动力学（CP-AIMD）过程中高效调控电势，减少收敛所需的单点计算次数。模型训练部分包含两个模块：电势嵌入MACE（PE-MACE）与电势嵌入电子密度预测（PE-EDP）。PE-MACE基于MACE架构实现了显式电势机器学习力场（EEP-MLFF）。为克服EEP-MLFF在描述原子力方面的局限，我们开发了PE-EDP模块。PE-EDP同样基于等变图神经网络，可预测任意电势下的电子密度分布。以Pt(111)/水界面为模型体系，PE-MACE与PE-EDP在训练集和测试集上均表现出高精度。CP-MLMD获得的径向分布函数与CP-AIMD结果高度吻合，长时程模拟揭示了电势诱导的界面水分子重组。PE-EDP计算得到的平面积分电荷分布与Bader分析均与密度泛函理论（DFT）结果一致。这些结果表明，该框架能够同时描述任意电势下的原子动力学与电子密度分布，为研究电化学界面提供了有力工具。

摘要 (Abstract)

To further develop accurate and large-scale simulations of electrochemical interfaces, we propose a unified explicit electric potential framework to simultaneously predict atomic forces and electron density distributions. The framework consists of three components: data generation, model training, and application. The data generation component, implemented in Hy-DFT, efficiently regulates the potential during constant-potential ab initio molecular dynamics (CP-AIMD), reducing the number of single-point calculations required for convergence. The model training component includes two modules: Potential-Embedded MACE (PE-MACE) and Potential-Embedded Electron Density Prediction (PE-EDP). PE-MACE implements an explicit electric potential machine learning force field (EEP-MLFF) based on the MACE architecture. We develop PE-EDP to overcome the limitation of EEP-MLFF in describing atom forces. PE-EDP, also based on equivariant graph neural networks, predicts electron density distributions under arbitrary potentials. Using the Pt(111)/water interface as a model system, both PE-MACE and PE-EDP show high accuracy on training and test sets. Radial distribution functions from CP-MLMD agree well with CP-AIMD, and long-timescale simulations reveal potential-induced reorganization of interfacial water. Planar-integrated charge profiles and Bader analysis from PE-EDP are consistent with DFT results. These results demonstrate that the framework can simultaneously describe atomic dynamics and electron density distributions under arbitrary potentials, providing a useful tool for studying electrochemical interfaces.

关键词: explicit electric potential, machine learning force field, electron density prediction, electrochemical interfaces, MACE architecture, equivariant graph neural networks, CP-AIMD, Pt(111)/water interface

311. ❌ Self-consistent Hessian-level meta-generalized gradient approximation

作者: Pooria Dabbaghi, Juan Maria García Lastra, Piotr de Silva 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.07046v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究密度泛函理论中的Hessian-level meta-generalized gradient approximations (HL-MGGAs)，属于计算化学和材料科学领域，与深度学习、大模型技术完全无关。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究属于计算科学应用，但论文未使用AI/机器学习方法，而是基于传统物理模型，因此给予5分（有一定关联）。其他所有关键词均与大模型、深度学习技术相关，而本文专注于量子化学计算，无任何关联，均得0分。

!!! tip deepseek-chat TL;DR

该论文将密度泛函理论中的ϑ-MGGA类函数重新表述为基于Hessian的元广义梯度近似(HL-MGGAs)，提出了非经验性ϑ-PBE函数，并通过自洽实现证明了这种轨道无关、基于Hessian的交换相关函数在区分不同单电子密度极限方面的物理效用和可行性，在分子和固态数据集上显示出准确的化学吸附能和分子性质，但在预测体相晶格常数方面仍存在挑战。

摘要翻译

$\vartheta$-MGGA 类密度泛函被形式化重构为基于海森矩阵的元广义梯度近似（HL-MGGAs）。与依赖轨道相关动能密度或密度拉普拉斯算符的标准元广义梯度近似不同，HL-MGGAs 利用了完整的密度海森矩阵。我们引入了一种简化的非经验性泛函 $\vartheta$-PBE，并提出了在投影缀加波（PAW）方法中实现其自洽计算的路线图。通过利用完整的空间二阶密度导数集合，该泛函的底层描述符成功区分了不同的单电子密度极限（例如单中心原子密度与双中心化学键），而标准的等轨道指示符常将这些情形混淆。对分子和固态数据集的基准测试表明，虽然 $\vartheta$-PBE 能提供准确的化学吸附能与分子性质，但在预测体相晶格常数方面仍存在挑战。最终，本工作证明了设计不依赖于轨道、基于海森矩阵的交换相关泛函的物理实用性与可行性。

摘要 (Abstract)

The $\vartheta$-MGGA class of density functionals is formally reformulated as Hessian-level meta-generalized gradient approximations (HL-MGGAs). In contrast to standard meta-GGAs that rely on the orbital-dependent kinetic-energy density or the density Laplacian, HL-MGGAs utilize the full density Hessian. We introduce a simplified, non-empirical functional, $\vartheta$-PBE, and present a roadmap for its self-consistent implementation within the projector augmented-wave (PAW) method. By utilizing the complete set of spatial second-order density derivatives, the functional’s underlying descriptor successfully distinguishes between distinct one-electron density limits, such as single-center atomic densities and two-center bonds, that standard iso-orbital indicators often conflate. Benchmarks across molecular and solid-state datasets reveal that while $\vartheta$-PBE delivers accurate chemisorption energies and molecular properties, challenges remain in predicting bulk lattice constants. Ultimately, this work demonstrates the physical utility and feasibility of designing orbital-independent, Hessian-based exchange-correlation functionals.

关键词: density functional theory, meta-generalized gradient approximations, Hessian-level MGGAs, ϑ-PBE functional, self-consistent implementation, orbital-independent functionals, density Hessian, exchange-correlation functionals

312. ❌ Development of ab initio Hubbard parameter calculation schemes in the k-point sampling real-time TDDFT program in CP2K

作者: Kota Hanasaki, Sandra Luber 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06927v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于计算化学领域，具体研究在CP2K软件中实现从头算Hubbard参数计算方案，特别是基于线性响应的方法。论文内容与深度学习、大模型技术完全无关，所有技术关键词（如LLMs、MoE、RLHF等）均不适用。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究属于计算化学（可视为科学计算AI应用的一个子领域），但论文并未明确使用AI/机器学习方法，而是基于传统的密度泛函理论（DFT）和线性响应理论，因此给予5分（有一定关联）。其他所有关键词均得0分（完全无关）。

!!! tip deepseek-chat TL;DR

该论文在CP2K的k点采样实时TDDFT程序中实现了从头算Hubbard参数计算方案，并提出了一种新的基于线性响应的能量依赖Hubbard参数计算方法，扩展了最小跟踪线性响应方法以包含交换关联效应。

摘要翻译

我们在CP2K软件的k点采样实时含时密度泛函理论（RT-TDDFT）程序中实现了从头算哈伯德参数计算方案。我们提出了一种新的基于线性响应的能量依赖哈伯德参数计算方案。该方案扩展了[Moynihan等人，arXiv预印本arXiv:1704.08076(2017)；E. B. Linscott等人，Phys. Rev. B 98, 235157 (2018)]中提出的最小追踪线性响应方法，以实现计算能量依赖的哈伯德参数，这些参数反映了交换关联（xc）泛函中所包含的交换关联效应。
我们讨论了最小追踪线性响应方法的特性，并与另一种有前景的方案ACBN0 [Agapito等人，Phys. Rev. X, 5, 011006 (2015)]进行了比较。研究表明，尽管在静态性质计算的准确性上两者没有明显优劣之分，但根据各自的理论框架，每种方法都有其独特的动力学应用场景。

摘要 (Abstract)

We implemented ab initio Hubbard parameter calculation schemes in the k-point sampling real-time TDDFT (RT-TDDFT) program in CP2K. We propose a new linear-response-based calculation scheme for energy-dependent Hubbard parameters. Our scheme extends the minimum-tracking linear-response method proposed in [Moynihan et al., arXiv preprint arXiv:1704.08076(2017); E. B. Linscott et al., Phys. Rev. B 98, 235157 (2018)] to realize the calculation of energy-dependent Hubbard parameters that reflect the exchange-correlation (xc) effects included in the xc-functional. We discuss the properties of the minimum-tracking linear-response method in comparison to another promising scheme, ACBN0 [Agapito et al., Phys. Rev. X, 5, 011006 (2015)]. We show that, while neither clearly outperforms the other in the accuracy of static property calculations, each has a distinct dynamical application depending on its theoretical formulation.

关键词: ab initio Hubbard parameters, real-time TDDFT, CP2K, linear-response method, k-point sampling, exchange-correlation effects, minimum-tracking, energy-dependent parameters

313. ❌ Spin-adapted neural network backflow for strongly correlated electrons

作者: Yunzhi Li, Zibo Wu, Bohan Zhang, Wei-Hai Fang, Zhendong Li 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06841v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于量子化学和强关联电子系统的神经网络变分波函数方法，核心贡献是提出了一种自旋适应的神经网络回流（SA-NNBF）方法，用于精确模拟过渡金属配合物等系统。论文内容与绝大多数关键词（涉及大模型技术、训练方法、推理优化、对齐、代理等）完全无关，因为这些关键词特指自然语言处理领域的大语言模型技术。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在科学计算（具体是量子化学）中的应用，但并非生物信息学或化学信息学的典型应用，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了一种自旋适应的神经网络回流（SA-NNBF）方法，解决了强关联电子系统中神经网络波函数的自旋污染问题，并在包括固氮酶FeMo辅因子在内的分子系统中实现了比现有方法更高的精度和更低的计算成本。

摘要翻译

要精确描述过渡金属配合物等体系中的强关联电子，必须严格遵循自旋对称性，而这一特性在现代基于神经网络的变分波函数中基本缺失。这种缺陷可能导致在模拟近简并自旋态体系时产生严重的自旋污染。为解决这一局限，我们提出了一种自旋适配的神经网络回流（SA-NNBF）拟设，该拟设以二次量子化形式表述，适用于费米子晶格模型和从头算量子化学。我们的方法通过将神经网络回流空间分量与以乘积和形式表达的自旋本征函数相结合，构建出完全反对称的波函数。为应对自旋适配的计算复杂性，我们引入了针对自旋本征函数的张量压缩算法，以及一种基于二次量子化中粒子-空穴对偶性的更紧凑波函数表示。这些进展使得利用SA-NNBF进行变分蒙特卡洛计算成为可能，可应用于具有超过一百个电子的挑战性分子体系，包括固氮酶中的铁钼辅因子（FeMoco）。在典型强关联分子上的应用表明，在参数数量相近的情况下，SA-NNBF始终优于标准神经网络回流（NNBF）。此外，对于FeMoco体系，SA-NNBF在显著减少计算资源的同时，其精度超越了当前最先进的自旋适配密度矩阵重正化群（SA-DMRG）算法。我们的工作为探索相互作用费米子问题中完全保持对称性的神经网络量子态奠定了理论基础。

摘要 (Abstract)

Accurately describing strongly correlated electrons in systems such as transition metal complexes requires strict adherence to spin symmetry, a feature largely absent in modern neural-network-based variational wavefunctions. This deficiency can lead to severe spin contamination in simulating systems with near-degenerate spin states. To resolve this limitation, we present a spin-adapted neural network backflow (SA-NNBF) ansatz, formulated in second quantization for fermionic lattice models and ab initio quantum chemistry. Our approach constructs a fully antisymmetric wavefunction by combining a neural-network backflow spatial component with a spin eigenfunction expressed in a sum-of-products form. To address the computational complexity of spin adaptation, we introduce a tensor compression algorithm for spin eigenfunctions, and a more compact wavefunction representation based on the particle-hole duality in second quantization. These advancements enable variational Monte Carlo calculations using SA-NNBF for challenging molecular systems with more than one hundred electrons, including the FeMo-cofactor (FeMoco) in nitrogenase. Applications to prototypical strongly correlated molecules demonstrate that SA-NNBF consistently outperforms standard NNBF with a similar number of parameters. Furthermore, it surpasses the accuracy of the state-of-the-art spin-adapted density matrix renormalization group (SA-DMRG) algorithm for FeMoco with a significantly reduced computational resource. Our work establishes a foundational framework for exploring fully symmetry-preserving neural-network quantum states for interacting fermion problems.

关键词: spin-adapted neural network backflow, strongly correlated electrons, variational Monte Carlo, quantum chemistry, spin symmetry, FeMo-cofactor, neural-network quantum states, fermionic lattice models

314. ❌ A Massively Scalable Ligand-Protein Dissociation Dynamic Database Derived from Atomistic Molecular Modelling

作者: Maodong Li, Dechin Chen, Zhijun Pan, Zhe Wang, Yi Isaac Yang 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06761v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于计算生物学和药物设计领域，通过大规模分子动力学模拟构建了配体-蛋白质解离动态数据库（DD-03B），用于支持下一代生成式AI模型的训练和基准测试。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、代理系统等）完全无关，因为这些关键词描述的是大语言模型/深度学习的技术细节，而论文并未涉及这些技术。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文明确属于生物信息学/科学AI领域，且摘要末尾提到该数据库为训练和基准测试下一代生成式AI模型提供了关键基础，因此给予10分（高度相关，核心应用领域）。

!!! tip deepseek-chat TL;DR

该研究解决了药物设计中缺乏大规模动态数据来理解药物-蛋白质相互作用动力学的问题，通过构建一个包含19037个配体-蛋白质复合物解离轨迹的大规模数据库（DD-03B），为训练和基准测试下一代生成式AI模型预测和优化药物解离动力学提供了关键资源。

摘要翻译

理解药物-蛋白质相互作用的动力学对于药物设计至关重要，然而该领域缺乏大规模动态数据以超越静态结构分析。本文介绍DD-03B——一个高度可扩展的数据库，为广泛的配体-蛋白质复合物提供动态全原子解离轨迹。通过运用并扩展经过验证的计算流程，我们对源自PDBbind+v2020R1的19,037个配体-蛋白质复合物生成了解离轨迹，构建了包含约30亿模拟帧、总容量达40 TB的数据仓库。针对这些具有实验结合亲和力（kd）但通常缺乏实测解离速率（koff）的体系，我们通过轨迹重加权计算并分配了解离速率常数。分析表明，蛋白质-配体复合物可分为三种机制类型（通路主导型、开放口袋型和熵口袋型系统），每种类型都需要采用不同的策略进行精确动力学表征。结合先前发布的DD-13M，DD-03B构成了可扩展解离动力学数据库（DDD）项目的核心，该数据库将持续通过新轨迹进行扩充。这一大规模公开资源为训练和基准测试新一代生成式人工智能模型奠定了关键基础，以预测和优化药物-蛋白质解离动力学。

摘要 (Abstract)

Understanding the kinetics of drug-protein interactions is paramount for drug design, yet the field lacks large-scale, dynamic data to move beyond static structural analysis. Here, we present DD-03B, a massively scalable database providing dynamic, all-atom dissociation trajectories for a broad set of ligand-protein complexes. Utilising and extending a validated computational pipeline, we generated dissociation trajectories for 19,037 ligand-protein complexes sourced from PDBbind+v2020R1, resulting in a repository of approximately 0.3 billion simulation frames totalling 40 TB in size. For these systems-which possess experimental binding affinities (kd) but typically lack measured koff rates-we computed and assigned dissociation rate constants through trajectory reweighting. Our analysis reveals that protein-ligand complexes can be categorised into three mechanistic types (pathway-dominant, open-pocket, and entropy-pocket systems), each requiring distinct strategies for accurate kinetic characterisation. Together with our previously released DD-13M, DD-03B forms the core of the expandable Dissociation Dynamic Database (DDD) project, which will be continuously augmented with new trajectories. This large-scale, publicly available resource establishes a critical foundation for training and benchmarking next-generation generative AI models to predict and optimise drug-protein dissociation kinetics.

关键词: ligand-protein dissociation, molecular dynamics simulation, drug design, kinetics, large-scale database, generative AI models, binding affinity, dissociation rate constants

315. ❌ Projector, Neural, and Tensor-Network Representations of $\mathbb{Z}_N$ Cluster and Dipolar-cluster SPT States

作者: Seungho Lee, Daesik Kim, Hyun-Yong Lee, Jung Hoon Han 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06741v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究对称性保护拓扑（SPT）态的表示方法，包括投影表示、神经网络量子态（NQS）、矩阵乘积态（MPS）和张量乘积态（TPS），属于凝聚态物理和量子计算领域。所有关键词均与大语言模型、深度学习技术原理或AI应用相关，但论文未涉及任何大模型、深度学习或AI技术，仅使用了“神经网络量子态”这一术语，但这是量子物理中的概念，与深度学习中的神经网络无关。唯一可能相关的关键词是“AI for Science OR Bioinformatics OR Cheminformatics”，因为论文属于基础科学研究（物理），但未使用AI方法，因此给予5分（有一定关联）。其他关键词完全无关，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了Z_N簇态和偶极簇态的高效表示方法，提出了投影表示、神经网络量子态、矩阵乘积态和张量乘积态等多种等价表示，并推广了Kramers-Wannier算子的构造。

摘要翻译

$\mathbb{Z}_N$簇态波函数是具有$\mathbb{Z}_N \times \mathbb{Z}_N$对称性的对称保护拓扑（SPT）序的一个范例，它可以通过多种等价方式表达。我们确认基于投影子的方案——即$P$表示——是表达簇态与偶极簇态波函数的有效方法。采用受限玻尔兹曼机方案，将$P$表示中的相互作用矩阵用神经权重矩阵重写，使我们能够构建同一状态的神经量子态（NQS）与矩阵乘积态（MPS）表示。NQS与MPS表示的区别仅在于权重矩阵在矩阵乘积中被拆分和组合的方式。对于$\mathbb{Z}_N$簇态与偶极簇态，我们以闭合形式推导了耦合物理自旋$s$与隐变量$h$的权重函数$W(s,h)$，从而将先前针对$Z_2$簇态的构造推广至$\mathbb{Z}_N$情形。对于受两个电荷对称性与两个偶极对称性保护的偶极簇态，我们所发展的方法导出了波函数的张量乘积态（TPS）表示，其中每个局域张量携带三个虚拟指标，分别连接给定格点与其两个最近邻和一个次近邻。我们通过密度矩阵重整化群模拟将所得TPS构造与常规MPS表示进行基准比较，并论证TPS可能为某些调制的SPT态提供更高效的表示。作为研究的一个副产品，我们将先前$Z_2$情形下的Kramers-Wannier（KW）算符的矩阵乘积算符构造推广至$\mathbb{Z}_N$，并将其解释为$\mathbb{Z}_N$变量上离散傅里叶变换的偶极化推广。这一新解释自然地说明了KW映射为何不可逆。

摘要 (Abstract)

The $\mathbb{Z}_N$ cluster-state wavefunction, a paradigmatic example of symmetry-protected topological (SPT) order with $\mathbb{Z}_N \times \mathbb{Z}_N$ symmetry, is expressed in various equivalent ways. We identify the projector-based scheme called the $P$-representation as the efficient way to express cluster and dipolar cluster state’s wavefunctions. Employing the restricted Boltzmann machine scheme to re-write the interaction matrix in the $P$-representation in terms of neural weight matrices allows us to develop the neural quantum state (NQS) and the matrix product state (MPS) representations of the same state. The NQS and MPS representations differ only in the way the weight matrices are split and grouped together in a matrix product. For both $\mathbb{Z}_N$ cluster and dipolar cluster states, we derive in closed form the weight function $W(s,h)$ that couples physical spins $s$ to hidden variables $h$, generalizing the previous construction for $Z_2$ cluster states to $\mathbb{Z}_N$. For the dipolar cluster state protected by two charge and two dipole symmetries, the procedure we have developed leads to the tensor product state (TPS) representation of the wavefunction where each local tensor carries three virtual indices connecting a given site to two nearest neighbors and one further neighbor. We benchmark the resulting TPS construction against conventional MPS representation using density-matrix renormalization group simulations and argue that the TPS could offer a more efficient representation for some modulated SPT states. As a by-product of the investigation, we generalize the previous $Z_2$ matrix product operator construction of the Kramers-Wannier (KW) operator to $\mathbb{Z}_N$ and interprets it as the dipolar generalization of the discrete Fourier transform on $\mathbb{Z}_N$ variables. The new interpretation naturally explains why the KW map is non-invertible.

关键词: symmetry-protected topological order, cluster state, neural quantum state, matrix product state, tensor product state, Z_N symmetry, dipolar cluster state, Kramers-Wannier operator

316. ❌ The effects of dispersion damping and three-body interactions for accurate layered-material exfoliation energies

作者: Adrian F. Rumson, Kyle R. Bryenton, Erin R. Johnson 期刊/来源: arxiv 发布日期: 2026-04-08 arXiv链接: http://arxiv.org/abs/2604.06539v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于计算材料科学领域，研究密度泛函理论（DFT）中的色散校正方法（XDM模型）和阻尼函数对层状材料剥离能预测精度的影响，并评估了包含三体相互作用（ATM项）的改进效果。论文内容与绝大多数关键词（涉及大模型、深度学习、训练技术、推理优化、对齐、智能体等）完全无关，因为这些关键词均属于人工智能/机器学习领域，而本文是纯粹的计算化学/物理研究。唯一略有相关的是“AI for Science OR Bioinformatics OR Cheminformatics”，因为该研究属于“AI for Science”广义范畴下的计算科学应用（使用计算方法解决科学问题），但论文本身并未使用或提及任何AI/机器学习模型，而是基于第一性原理计算，因此相关性较弱，给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

本研究评估了XDM(Z)阻尼函数在密度泛函理论中对层状材料剥离能和晶格常数的预测精度，并发现加入三体相互作用（ATM项）能进一步提升XDM(BJ)和XDM(Z)模型的性能，在LM26基准上取得了半局部泛函的最佳结果。

摘要翻译

对层状材料剥离能与晶格常数的精确预测，取决于对伦敦色散物理的正确描述。密度泛函理论（DFT）中的现代后验色散修正方法，例如交换空穴偶极矩（XDM）模型，在长程区域捕捉了正确的渐近行为，同时利用阻尼函数来防止短程区域出现非物理发散。在联合原子极限下，色散能既可通过经典的Becke-Johnson（BJ）阻尼函数，也可通过新型Z阻尼函数被阻尼至一个有限且非零的值。XDM(BJ)方法先前已在层状材料建模中展现出卓越的准确性，例如在包含石墨、六方氮化硼、氧化铅(II)和过渡金属二硫族化物的LM26基准测试中。本研究首次在同一基准上评估了XDM(Z)方法的性能。我们还表明，通过引入Axilrod-Teller-Muto（ATM）项来包含三体相互作用，能进一步改善XDM(BJ)和XDM(Z)计算得到的剥离能，从而在使用半局域泛函的条件下，取得了迄今为止在LM26基准上的最佳性能。

摘要 (Abstract)

Accurate predictions of exfoliation energies and lattice constants of layered materials hinge on a correct description of London dispersion physics. Modern a posteriori dispersion corrections in density-functional theory (DFT), such as the exchange-hole dipole moment (XDM) model, capture the proper asymptotic behaviour at long range while making use of damping functions to prevent unphysical divergence at short range. In the united-atom limit, the dispersion energy is damped to a finite, non-zero value by both the canonical Becke–Johnson (BJ) damping function and the new Z-damping function. XDM(BJ) has previously demonstrated exceptional accuracy for modelling layered materials, such as in the LM26 benchmark, which includes graphite, hexagonal boron nitride, lead(II) oxide, and transition-metal dichalcogenides. This work presents the first assessment of XDM(Z) on the same benchmark. We also show that inclusion of three-body interactions via the Axilrod–Teller–Muto (ATM) term further improves the computed exfoliation energies for both XDM(BJ) and XDM(Z), yielding the best performance achieved on LM26 using semi-local functionals to date.

关键词: exfoliation energies, layered materials, density-functional theory, dispersion corrections, XDM model, damping functions, three-body interactions, LM26 benchmark

317. ❌ Coupled-Cluster Imaginary-Time Evolution and the Coupled-Cluster Energy Variance

作者: Yuhang Ai, Huanchen Zhai, Garnet Kin-Lic Chan 期刊/来源: arxiv 发布日期: 2026-04-07 arXiv链接: http://arxiv.org/abs/2604.06429v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是量子化学中的耦合簇理论方法，具体涉及虚时演化、能量方差等计算化学概念。所有评分关键词均与大语言模型、深度学习技术、AI应用相关，而本文属于纯理论化学物理领域，与人工智能、机器学习、大模型技术完全无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于虚时演化的耦合簇形式化方法，并引入耦合簇能量方差来识别物理正则化的振幅，解决了标准耦合簇振幅方程在无解情况下的问题。

摘要翻译

本文探讨了一种基于任意参考态进行虚时间演化的耦合簇形式理论，并研究了由此产生的演化轨迹特性。若存在有限值极限，该演化在长时间极限下会收敛至标准耦合簇振幅方程的解；但当此类极限不存在时，演化轨迹仍能提供超越标准解的额外信息。我们引入了耦合簇能量方差的概念，当振幅方程的解不合理时，该方差可通过其极小值点识别出经物理正则化的耦合簇振幅。我们通过单参考态与多参考态耦合簇框架下的若干探索性算例，验证了该形式理论的价值。

摘要 (Abstract)

We discuss a coupled-cluster formalism for carrying out imaginary-time evolution from an arbitrary reference, and study the properties of the resulting evolution trajectories. The evolution converges to a solution of the standard coupled-cluster amplitude equations in the long-time limit if a finite valued limit exists, but when such a limit does not exist, the trajectories still contain additional information beyond the standard solutions. We introduce the coupled-cluster energy variance which through its minima identifies physically regularized coupled-cluster amplitudes when the solutions of the amplitude equations are unreasonable. We demonstrate the value of this formalism in several exploratory examples within single- and multi-reference coupled-cluster formulations.

关键词: coupled-cluster, imaginary-time evolution, energy variance, amplitude equations, single-reference, multi-reference, quantum chemistry, theoretical chemistry

Token 消耗统计

总计: 1,011,032 tokens（输入 696,330 / 输出 314,702）

模型	输入	输出	合计
deepseek-chat	570,087	314,702	884,789
glm-4.7	126,243	0	126,243

📊 ArXiv 研究报告 (2026-04-10)

📌 配置信息

关键词列表（共 27 个，总权重 27.0）

评分设置

📈 论文统计

⭐ 及格论文详细分析

1. Argus: Reorchestrating Static Analysis via a Multi-Agent Ensemble for Full-Chain Security Vulnerabil

2. Chatbot-Based Assessment of Code Understanding in Automated Programming Assessment Systems

3. Is Biomedical Specialization Still Worth It? Insights from Domain-Adaptive Language Modelling with a

4. TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories

5. TurboAgent: An LLM-Driven Autonomous Multi-Agent Framework for Turbomachinery Aerodynamic Design

6. ReDAct: Uncertainty-Aware Deferral for LLM Agents

7. When to Call an Apple Red: Humans Follow Introspective Rules, VLMs Don’t

8. Does a Global Perspective Help Prune Sparse MoEs Elegantly?

9. A Systematic Study of Retrieval Pipeline Design for Retrieval-Augmented Medical Question Answering

10. Geometric Properties of the Voronoi Tessellation in Latent Semantic Manifolds of Large Language Mode

11. StructKV: Preserving the Structural Skeleton for Scalable Long-Context Inference

12. Multi-Turn Reasoning LLMs for Task Offloading in Mobile Edge Computing

13. An empirical study of LoRA-based fine-tuning of large language models for automated test case genera

14. SentinelSphere: Integrating AI-Powered Real-Time Threat Detection with Cybersecurity Awareness Train

15. Beyond Surface Judgments: Human-Grounded Risk Evaluation of LLM-Generated Disinformation

16. Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization

📋 所有论文列表

1. ✅ Argus: Reorchestrating Static Analysis via a Multi-Agent Ensemble for Full-Chain Security Vulnerability Detection

2. ✅ Chatbot-Based Assessment of Code Understanding in Automated Programming Assessment Systems

3. ✅ Is Biomedical Specialization Still Worth It? Insights from Domain-Adaptive Language Modelling with a New French Health Corpus

4. ✅ TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories

5. ✅ TurboAgent: An LLM-Driven Autonomous Multi-Agent Framework for Turbomachinery Aerodynamic Design

6. ✅ ReDAct: Uncertainty-Aware Deferral for LLM Agents

7. ✅ When to Call an Apple Red: Humans Follow Introspective Rules, VLMs Don’t

8. ✅ Does a Global Perspective Help Prune Sparse MoEs Elegantly?

9. ✅ A Systematic Study of Retrieval Pipeline Design for Retrieval-Augmented Medical Question Answering

10. ✅ Geometric Properties of the Voronoi Tessellation in Latent Semantic Manifolds of Large Language Models

11. ✅ StructKV: Preserving the Structural Skeleton for Scalable Long-Context Inference

12. ✅ Multi-Turn Reasoning LLMs for Task Offloading in Mobile Edge Computing

13. ✅ An empirical study of LoRA-based fine-tuning of large language models for automated test case generation

14. ✅ SentinelSphere: Integrating AI-Powered Real-Time Threat Detection with Cybersecurity Awareness Training

15. ✅ Beyond Surface Judgments: Human-Grounded Risk Evaluation of LLM-Generated Disinformation

16. ✅ Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization

17. ❌ Adaptive Prompt Structure Factorization: A Framework for Self-Discovering and Optimizing Compositional Prompt Programs

18. ❌ Steering the Verifiability of Multimodal AI Hallucinations

19. ❌ FedDetox: Robust Federated SLM Alignment via On-Device Data Sanitization

20. ❌ Restoring Heterogeneity in LLM-based Social Simulation: An Audience Segmentation Approach

21. ❌ Continuous Interpretive Steering for Scalar Diversity

22. ❌ STQuant: Spatio-Temporal Adaptive Framework for Optimizer Quantization in Large Multimodal Model Training

23. ❌ Environmental, Social and Governance Sentiment Analysis on Slovene News: A Novel Dataset and Models

24. ❌ On the Price of Privacy for Language Identification and Generation

25. ❌ SQLStructEval: Structural Evaluation of LLM Text-to-SQL Generation

26. ❌ EventFace: Event-Based Face Recognition via Structure-Driven Spatiotemporal Modeling

27. ❌ From Static to Interactive: Adapting Visual in-Context Learners for User-Driven Tasks

28. ❌ Sparse-Aware Neural Networks for Nonlinear Functionals: Mitigating the Exponential Dependence on Dimension

29. ❌ Fast Spatial Memory with Elastic Test-Time Training

30. ❌ How Well Do Vision-Language Models Understand Sequential Driving Scenes? A Sensitivity Study

31. ❌ Toward a Tractability Frontier for Exact Relevance Certification

32. ❌ RoSHI: A Versatile Robot-oriented Suit for Human Data In-the-Wild

33. ❌ MoRight: Motion Control Done Right

34. ❌ Syntax Is Easy, Semantics Is Hard: Evaluating LLMs for LTL Translation

35. ❌ Evaluating In-Context Translation with Synchronous Context-Free Grammar Transduction

36. ❌ Region-Graph Optimal Transport Routing for Mixture-of-Experts Whole-Slide Image Classification

37. ❌ CADENCE: Context-Adaptive Depth Estimation for Navigation and Computational Efficiency

38. ❌ Android Coach: Improve Online Agentic Training Efficiency with Single State Multiple Actions

39. ❌ Making Room for AI: Multi-GPU Molecular Dynamics with Deep Potentials in GROMACS

40. ❌ Validated Intent Compilation for Constrained Routing in LEO Mega-Constellations

41. ❌ Designing Safe and Accountable GenAI as a Learning Companion with Women Banned from Formal Education

42. ❌ $k$-server-bench: Automating Potential Discovery for the $k$-Server Conjecture

43. ❌ How Much LLM Does a Self-Revising Agent Actually Need?

44. ❌ Mixture Proportion Estimation and Weakly-supervised Kernel Test for Conditional Independence

45. ❌ The ATOM Report: Measuring the Open Language Model Ecosystem

46. ❌ TeaLeafVision: An Explainable and Robust Deep Learning Framework for Tea Leaf Disease Classification

47. ❌ Reason in Chains, Learn in Trees: Self-Rectification and Grafting for Multi-turn Agent Policy Optimization

48. ❌ Energy-based Tissue Manifolds for Longitudinal Multiparametric MRI Analysis

49. ❌ Bridging MRI and PET physiology: Untangling complementarity through orthogonal representations

50. ❌ Dynamic Context Evolution for Scalable Synthetic Data Generation

51. ❌ Energy Saving for Cell-Free Massive MIMO Networks: A Multi-Agent Deep Reinforcement Learning Approach

52. ❌ CSA-Graphs: A Privacy-Preserving Structural Dataset for Child Sexual Abuse Research

53. ❌ Self-Discovered Intention-aware Transformer for Multi-modal Vehicle Trajectory Prediction

54. ❌ Mixed-Initiative Context: Structuring and Managing Context for Human-AI Collaboration

55. ❌ Assessing the Added Value of Onboard Earth Observation Processing with the IRIDE HEO Service Segment

56. ❌ Information as Structural Alignment: A Dynamical Theory of Continual Learning

57. ❌ The Impact of Steering Large Language Models with Persona Vectors in Educational Applications