📊 ArXiv 研究报告 (2026-04-17)

生成时间: 2026-04-17 09:36:17 数据源: ArXiv

📌 配置信息

关键词列表（共 27 个，总权重 27.0）

关键词	权重	类型
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	主要
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	主要
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	主要
“Scaling Laws” AND “Data Quality”	1.0	主要
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	主要
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	主要
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	主要
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	主要
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	主要
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	主要
“Context Window Extension” OR “Long Context LLMs”	1.0	主要
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	主要
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	主要
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	主要
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	主要
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	主要
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	主要
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	主要
“Multi-agent Systems” OR “Agent Coordination”	1.0	主要
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	主要
“Speculative Decoding” OR “Inference Acceleration”	1.0	主要
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	主要
“Mechanistic Interpretability” OR “Explainable AI”	1.0	主要
“World Models” AND “General World Models”	1.0	主要
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	主要
“In-context Learning” OR “Many-shot Learning”	1.0	主要
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	主要

评分设置

每个关键词最大分: 15
及格分公式: 5.0 + 0.8 × 总权重
当前及格分: 26.6

📈 论文统计

总抓取: 288 篇
及格论文: 11 篇 (3.8%)

⭐ 及格论文详细分析

1. Co-FactChecker: A Framework for Human-AI Collaborative Claim Verification Using Large Reasoning Mode

作者: Dhruv Sahnan, Subhabrata Dutta, Tanmoy Chakraborty, Preslav Nakov, Iryna Gurevych 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13706v1

评分: 64.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	10.0/10	10.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	8.0/10	8.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	8.0/10	8.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	10.0/10	10.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	8.0/10	8.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文提出Co-FactChecker框架，专注于利用大型语言模型（LLMs）和大型推理模型（LRMs）进行人机协作的声明验证。核心相关关键词包括：LLMs（直接提及）、Chain of Thought/System 2 Thinking（涉及模型思维轨迹作为共享草稿纸、多步推理）、Hallucination Mitigation/Factuality（专注于声明验证和事实性）、Self-Correction（通过专家反馈引导模型推理）、LLM Agents（人机协作框架）、Explainable AI（产生更易解释的思维轨迹）。其他关键词如MoE、SLMs、训练技术、优化方法、特定应用领域等，论文未涉及，故评分为0。

!!! tip deepseek-chat TL;DR

该论文针对LLMs在声明验证中缺乏领域知识和深度上下文理解的问题，提出了Co-FactChecker人机协作框架，通过将专家反馈转化为思维轨迹编辑来指导模型推理，实验证明其优于现有自主和人机协作方法，并产生更高质量、更易解释的推理和结论。

摘要翻译

专业事实核查人员依赖领域知识和深层语境理解来验证主张。大型语言模型（LLMs）与大型推理模型（LRMs）缺乏这种基础，主要仅基于现有证据进行推理，这导致了专家主导与全自动化主张验证之间的不匹配。为弥合这一差距，我们提出人机协作是一条更具前景的路径，其中基于现实世界知识和领域专长的专家反馈可指导模型的推理过程。然而，现有的LRMs难以根据自然语言反馈进行校准，尤其是在多轮交互场景中。我们提出Co-FactChecker——一个用于人机协作主张验证的框架。我们引入了一种新的交互范式，将模型的思维轨迹视为共享草稿板。Co-FactChecker将专家反馈转化为轨迹编辑，从而对思维轨迹进行针对性修改，规避了基于对话的交互方式的缺陷。我们提供了理论分析，表明轨迹编辑相较于多轮对话具有优势；自动评估结果证明，Co-FactChecker在性能上超越了现有的自主验证及人机协作方法。人工评估进一步显示，相较于多轮对话，用户更倾向于选择Co-FactChecker，因其能产生更高质量的推理与判定结论，同时提供相对更易解读、更有价值的思维轨迹。

摘要 (Abstract)

Professional fact-checkers rely on domain knowledge and deep contextual understanding to verify claims. Large language models (LLMs) and large reasoning models (LRMs) lack such grounding and primarily reason from available evidence alone, creating a mismatch between expert-led and fully automated claim verification. To mitigate this gap, we posit human-AI collaboration as a more promising path forward, where expert feedback, grounded in real-world knowledge and domain expertise, guides the model’s reasoning. However, existing LRMs are hard to calibrate to natural language feedback, particularly in a multi-turn interaction setup. We propose Co-FactChecker, a framework for human-AI collaborative claim verification. We introduce a new interaction paradigm that treats the model’s thinking trace as a shared scratchpad. Co-FactChecker translates expert feedback into trace-edits that introduce targeted modifications to the trace, sidestepping the shortcomings of dialogue-based interaction. We provide theoretical results showing that trace-editing offers advantages over multi-turn dialogue, and our automatic evaluations demonstrate that Co-FactChecker outperforms existing autonomous and human-AI collaboration approaches. Human evaluations further show that Co-FactChecker is preferred over multi-turn dialogue, producing higher quality reasoning and verdicts along with relatively easier to interpret and more useful thinking traces.

关键词: human-AI collaboration, claim verification, large language models, large reasoning models, thinking trace, trace-editing, fact-checking, expert feedback

2. MIND: AI Co-Scientist for Material Research

作者: Geonhee Ahn, Donghyun Lee, Hayoung Doo, Jonggeol Na, Hyunsoo Cho, Sookyung Kim 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13699v1

评分: 55.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	5.0/10	5.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	10.0/10	10.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 论文核心是LLM驱动的多智能体框架MIND，用于材料研究的自动化假设验证，因此与"Large Language Models"、“LLM Agents”、“Multi-agent Systems"和"AI for Science"高度相关（10分）。系统涉及假设精炼和基于辩论的验证，与推理关键词有一定关联（5分）。其他关键词如MoE、SFT、RAG等未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个名为MIND的LLM驱动多智能体框架，通过集成机器学习原子间势能进行自动化假设验证，解决了材料研究中缺乏自动化实验验证的问题，并提供了可扩展的Web界面。

摘要翻译

大语言模型（LLMs）已赋能面向科学发现的智能体AI系统，但多数方法仍局限于基于文本的推理，缺乏自动化实验验证。我们提出MIND，一个用于材料研究中自动化假设验证的LLM驱动框架。MIND将科学发现过程组织为多智能体流程中的假设细化、实验执行和基于辩论的验证。为实现实验验证，该系统集成了机器学习原子间势（Machine Learning Interatomic Potentials），特别是SevenNet-Omni，以实现可扩展的计算机模拟实验。我们还提供了基于网络的用户界面以支持自动化假设测试。其模块化设计允许集成更多实验模块，使该框架能适应更广泛的科学工作流程。代码发布于：https://github.com/IMMS-Ewha/MIND，演示视频可见：https://youtu.be/lqiFe1OQzN4。

摘要 (Abstract)

Large language models (LLMs) have enabled agentic AI systems for scientific discovery, but most approaches remain limited to textbased reasoning without automated experimental verification. We propose MIND, an LLM-driven framework for automated hypothesis validation in materials research. MIND organizes the scientific discovery process into hypothesis refinement, experimentation, and debate-based validation within a multi-agent pipeline. For experimental verification, the system integrates Machine Learning Interatomic Potentials, particularly SevenNet-Omni, enabling scalable in-silico experiments. We also provide a web-based user interface for automated hypothesis testing. The modular design allows additional experimental modules to be integrated, making the framework adaptable to broader scientific workflows. The code is available at: https://github.com/IMMS-Ewha/MIND, and a demonstration video at: https://youtu.be/lqiFe1OQzN4.

关键词: Large Language Models, LLM Agents, Multi-agent Systems, AI for Science, Materials Research, Automated Hypothesis Validation, Machine Learning Interatomic Potentials, Scientific Discovery

3. GeoAgentBench: A Dynamic Execution Benchmark for Tool-Augmented Agents in Spatial Analysis

作者: Bo Yu, Cheng Yang, Dongyang Hou, Chengfu Liu, Jiayao Liu, Chi Wang, Zhiming Zhang, Haifeng Li, Wentao Yang 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13888v1

评分: 54.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	8.0/10	8.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	8.0/10	8.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	10.0/10	10.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	8.0/10	8.0

评分理由: 论文核心研究LLM-based agents在GIS领域的应用，与"Large Language Models”、“LLM Agents”、“Tool Use"高度相关（10分）。涉及多步推理和专家认知流程，与"Chain of Thought”、“System 2 Thinking"相关（8分）。属于AI在科学领域的应用，与"AI for Science"相关（8分）。其他关键词如MoE、量化、对齐等未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文针对LLM-based GIS agents在复杂空间分析中评估困难的问题，提出了动态评估基准GeoAgentBench和Plan-and-React架构，显著提升了多步推理和错误恢复能力。

摘要翻译

将大型语言模型（LLMs）集成到地理信息系统（GIS）中，标志着空间分析向自主化发展的范式转变。然而，由于地理空间工作流程具有复杂、多步骤的特性，评估这些基于LLM的智能体仍然面临挑战。现有基准测试主要依赖静态文本或代码匹配，忽略了动态的运行时反馈以及空间输出的多模态特性。为弥补这一不足，我们提出了GeoAgentBench（GABench），这是一个为工具增强型GIS智能体量身定制的动态交互式评估基准。GABench提供了一个真实的执行沙箱，集成了117个原子级GIS工具，覆盖了6个核心GIS领域的53类典型空间分析任务。我们认识到，在动态GIS环境中，精确的参数配置是执行成功的主要决定因素，因此设计了参数执行准确度（Parameter Execution Accuracy, PEA）指标，该指标采用“末次尝试对齐”策略来量化隐式参数推断的保真度。作为补充，我们提出了一种基于视觉语言模型（Vision-Language Model, VLM）的验证方法，用于评估数据空间准确性和制图风格的符合度。此外，针对因参数错配和运行时异常导致的频繁任务失败，我们开发了一种新颖的智能体架构——规划与反应（Plan-and-React），该架构通过将全局编排与逐步反应式执行解耦，模拟了专家的认知工作流程。对七个代表性LLM的大量实验表明，规划与反应范式显著优于传统框架，在逻辑严谨性和执行鲁棒性之间实现了最佳平衡，尤其是在多步骤推理和错误恢复方面。我们的研究结果揭示了当前的能力边界，并为评估和推进下一代自主地理人工智能（GeoAI）建立了坚实的标准。

摘要 (Abstract)

The integration of Large Language Models (LLMs) into Geographic Information Systems (GIS) marks a paradigm shift toward autonomous spatial analysis. However, evaluating these LLM-based agents remains challenging due to the complex, multi-step nature of geospatial workflows. Existing benchmarks primarily rely on static text or code matching, neglecting dynamic runtime feedback and the multimodal nature of spatial outputs. To address this gap, we introduce GeoAgentBench (GABench), a dynamic and interactive evaluation benchmark tailored for tool-augmented GIS agents. GABench provides a realistic execution sandbox integrating 117 atomic GIS tools, encompassing 53 typical spatial analysis tasks across 6 core GIS domains. Recognizing that precise parameter configuration is the primary determinant of execution success in dynamic GIS environments, we designed the Parameter Execution Accuracy (PEA) metric, which utilizes a “Last-Attempt Alignment” strategy to quantify the fidelity of implicit parameter inference. Complementing this, a Vision-Language Model (VLM) based verification is proposed to assess data-spatial accuracy and cartographic style adherence. Furthermore, to address the frequent task failures caused by parameter misalignments and runtime anomalies, we developed a novel agent architecture, Plan-and-React, that mimics expert cognitive workflows by decoupling global orchestration from step-wise reactive execution. Extensive experiments with seven representative LLMs demonstrate that the Plan-and-React paradigm significantly outperforms traditional frameworks, achieving the optimal balance between logical rigor and execution robustness, particularly in multi-step reasoning and error recovery. Our findings highlight current capability boundaries and establish a robust standard for assessing and advancing the next generation of autonomous GeoAI.

关键词: Large Language Models, LLM-based agents, GIS, spatial analysis, tool-augmented agents, multi-step reasoning, Plan-and-React, GeoAgentBench

4. LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning

作者: Sumeet Ramesh Motwani, Daniel Nichols, Charles London, Peggy Li, Fabio Pizzati, Acer Blake, Hasan Hammoud, Tavish McDonald, Akshat Naik, Alesia Ivanova, Vignesh Baskaran, Ivan Laptev, Ruben Glatt, Tal Ben-Nun, Philip Torr, Natasha Jaques, Ameya Prabhu, Brian Bartoldson, Bhavya Kailkhura, Christian Schroeder de Witt 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.14140v1

评分: 50.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	5.0/10	5.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	15.0/10	15.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	10.0/10	10.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	5.0/10	5.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

评分理由: 论文核心研究大语言模型（LLMs）的长链思维（Chain-of-Thought）推理能力，因此与"Chain of Thought"高度相关（15分），与"Large Language Models"高度相关（10分）。论文涉及长上下文推理，与"Context Window Extension"有一定关联（5分）。论文测试模型在复杂自主任务中的推理能力，与"LLM Agents"有一定关联（5分）。论文基准问题涵盖化学等领域，与"AI for Science"有一定关联（5分）。其他关键词如MoE、SFT、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了LongCoT基准来评估大语言模型在长链思维推理上的能力，发现当前前沿模型（如GPT-5.2和Gemini 3 Pro）的准确率低于10%，揭示了其在长时程推理上的显著不足。

摘要翻译

随着语言模型日益广泛地应用于复杂的自主任务，其进行长程准确推理的能力变得至关重要。该能力的核心组成部分是对长而复杂的思维链进行规划与管理。本文提出LongCoT，这是一个包含2,500个专家设计问题的可扩展基准测试集，涵盖化学、数学、计算机科学、国际象棋和逻辑等领域，旨在隔离并直接衡量前沿模型的长程思维链推理能力。每个问题由一个简短输入和一个可验证答案构成；解决这些问题需要在一个由相互依赖的步骤构成的图中进行导航，这些步骤涉及的推理标记数量可达数万至数十万。每个局部步骤对前沿模型而言均可独立处理，因此失败案例反映了长程推理的局限性。在发布时，最佳模型在LongCoT上的准确率低于10%（GPT 5.2：9.8%；Gemini 3 Pro：6.1%），揭示了当前能力存在显著差距。总体而言，LongCoT为长程推理提供了严格的衡量标准，能够追踪前沿模型在长时间跨度内进行可靠推理的能力。

摘要 (Abstract)

As language models are increasingly deployed for complex autonomous tasks, their ability to reason accurately over longer horizons becomes critical. An essential component of this ability is planning and managing a long, complex chain-of-thought (CoT). We introduce LongCoT, a scalable benchmark of 2,500 expert-designed problems spanning chemistry, mathematics, computer science, chess, and logic to isolate and directly measure the long-horizon CoT reasoning capabilities of frontier models. Problems consist of a short input with a verifiable answer; solving them requires navigating a graph of interdependent steps that span tens to hundreds of thousands of reasoning tokens. Each local step is individually tractable for frontier models, so failures reflect long-horizon reasoning limitations. At release, the best models achieve <10% accuracy (GPT 5.2: 9.8%; Gemini 3 Pro: 6.1%) on LongCoT, revealing a substantial gap in current capabilities. Overall, LongCoT provides a rigorous measure of long-horizon reasoning, tracking the ability of frontier models to reason reliably over extended periods.

关键词: LongCoT, Chain-of-Thought, Long-horizon reasoning, Language models, Benchmark, Autonomous tasks, Reasoning capabilities, Frontier models

5. Foresight Optimization for Strategic Reasoning in Large Language Models

作者: Jiashuo Wang, Jiawen Duan, Jian Wang, Kaitao Song, Chunpu Xu, Johnny K. W. Ho, Fenggang Yu, Wenjie Li, Johan F. Hoorn 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13592v1

评分: 46.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	8.0/10	8.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	8.0/10	8.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	10.0/10	10.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）在战略推理和决策能力方面的增强，特别是针对多智能体环境。因此，与"Large Language Models”、“LLM Agents”、“Multi-agent Systems"高度相关（10分）。论文涉及推理能力的提升，与"Chain of Thought"和"System 2 Thinking"有一定关联（8分）。其他关键词如模型架构、训练技术、优化方法、特定应用领域等，论文未直接涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型在多智能体环境中战略推理能力不足的问题，提出了Foresight Policy Optimization（FoPO）方法，通过整合对手建模原理来增强LLMs的战略决策能力，实验表明FoPO显著提升了不同规模和来源LLMs的战略推理能力，并展现出强大的泛化性能。

摘要翻译

大型语言模型（LLM）的推理能力已普遍取得显著进展。然而，由于缺乏显式的前瞻建模，现有基于推理的LLM在多智能体环境中仍难以展现有效的决策能力。为此，战略推理——即预测对手行为并预见其未来可能行动的最基本能力——被引入以缓解上述问题。战略推理是多智能体环境中有效决策的基础，但现有的LLM推理增强方法未能明确捕捉其前瞻性本质。本研究提出前瞻策略优化（Foresight Policy Optimization, FoPO）以增强LLM的战略推理能力，该方法将对手建模原则融入策略优化过程，从而能够显式地兼顾自身利益与对手影响。具体而言，我们构建了两个精心设计的数据集——合作性RSA与竞争性Taboo，这些数据集配备了精心设计的规则与适中难度，以促进在自博弈框架下对FoPO进行系统性研究。实验表明，FoPO显著提升了不同规模和来源的LLM的战略推理能力。此外，经FoPO训练的模型在领域外战略场景中展现出强大的泛化能力，其表现大幅优于标准的LLM推理优化基线方法。

摘要 (Abstract)

Reasoning capabilities in large language models (LLMs) have generally advanced significantly. However, it is still challenging for existing reasoning-based LLMs to perform effective decision-making abilities in multi-agent environments, due to the absence of explicit foresight modeling. To this end, strategic reasoning, the most fundamental capability to anticipate the counterpart’s behaviors and foresee its possible future actions, has been introduced to alleviate the above issues. Strategic reasoning is fundamental to effective decision-making in multi-agent environments, yet existing reasoning enhancement methods for LLMs do not explicitly capture its foresight nature. In this work, we introduce Foresight Policy Optimization (FoPO) to enhance strategic reasoning in LLMs, which integrates opponent modeling principles into policy optimization, thereby enabling explicit consideration of both self-interest and counterpart influence. Specifically, we construct two curated datasets, namely Cooperative RSA and Competitive Taboo, equipped with well-designed rules and moderate difficulty to facilitate a systematic investigation of FoPO in a self-play framework. Our experiments demonstrate that FoPO significantly enhances strategic reasoning across LLMs of varying sizes and origins. Moreover, models trained with FoPO exhibit strong generalization to out-of-domain strategic scenarios, substantially outperforming standard LLM reasoning optimization baselines.

关键词: Large Language Models, Strategic Reasoning, Multi-agent Environments, Foresight Policy Optimization, Decision-making, Opponent Modeling, Self-play Framework, Generalization

6. Enhancing Reinforcement Learning for Radiology Report Generation with Evidence-aware Rewards and Sel

作者: Qin Zhou, Guoyan Liang, Qianyi Yang, Jingyuan Chen, Sai Wu, Chang Yao, Zhe Wang 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13598v1

评分: 46.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	5.0/10	5.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	10.0/10	10.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	8.0/10	8.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 论文提出ESC-RL方法用于放射学报告生成，核心是强化学习与LLM结合的自校正偏好学习。与"Large Language Models"相关（8分），因使用LLM合成报告；与"Self-Correction"高度相关（10分），因核心是自校正偏好学习；与"Hallucination Mitigation"相关（8分），因关注临床忠实性、抑制虚假内容；与"AI for Science"高度相关（10分），属生物医学AI应用；与"Alignment"和"RLHF"有一定关联（各5分），因涉及偏好对齐和强化学习优化。其他关键词如MoE、量化、推理加速等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该研究针对放射学报告生成中奖励缺乏证据基础和缺乏自改进机制的问题，提出了证据感知自校正强化学习方法，通过组级证据对齐奖励和自校正偏好学习策略，在公开胸部X光数据集上实现了性能提升和最优结果。

摘要翻译

近期强化学习方法在放射学报告生成领域取得进展，但两个核心局限依然存在：（1）报告级奖励对临床忠实性提供的证据基础指导有限；（2）现有方法缺乏与临床偏好对齐的显式自我改进机制。我们提出临床对齐的证据感知自校正强化学习方法，其包含两个关键组件。首先，分组式证据感知对齐奖励机制提供分组化、证据感知的反馈。该方法通过强化真阳性的一致性依据、恢复假阴性的漏诊发现、抑制假阳性的无支持内容来实现优化。其次，自校正偏好学习策略能够从多组噪声观察中自动构建可靠的疾病感知偏好数据集，并利用大语言模型在无需人工监督的情况下合成精细化报告。该框架促进了临床忠实、疾病对齐的奖励机制，并支持训练过程中的持续自我改进。在两个公开胸部X光数据集上的大量实验表明，该方法实现了持续性能提升并达到最先进的性能水平。

摘要 (Abstract)

Recent reinforcement learning (RL) approaches have advanced radiology report generation (RRG), yet two core limitations persist: (1) report-level rewards offer limited evidence-grounded guidance for clinical faithfulness; and (2) current methods lack an explicit self-improving mechanism to align with clinical preference. We introduce clinically aligned Evidence-aware Self-Correcting Reinforcement Learning (ESC-RL), comprising two key components. First, a Group-wise Evidence-aware Alignment Reward (GEAR) delivers group-wise, evidence-aware feedback. GEAR reinforces consistent grounding for true positives, recovers missed findings for false negatives, and suppresses unsupported content for false positives. Second, a Self-correcting Preference Learning (SPL) strategy automatically constructs a reliable, disease-aware preference dataset from multiple noisy observations and leverages an LLM to synthesize refined reports without human supervision. ESC-RL promotes clinically faithful, disease-aligned reward and supports continual self-improvement during training. Extensive experiments on two public chest X-ray datasets demonstrate consistent gains and state-of-the-art performance.

关键词: Reinforcement Learning, Radiology Report Generation, Evidence-aware Reward, Self-correcting Preference Learning, Clinical Faithfulness, LLM Synthesis, Chest X-ray, State-of-the-art Performance

7. DiPO: Disentangled Perplexity Policy Optimization for Fine-grained Exploration-Exploitation Trade-Of

作者: Xiaofan Li, Ming Yang, Zhiyuan Ma, Shichao Ma, Jintao Du, Yu Cheng, Weiqiang Wang, Zhizhong Zhang, Xin Tan, Yanyun Qu, Lizhuang Ma, Yuan Xie 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13902v1

评分: 45.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	10.0/10	10.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	5.0/10	5.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	10.0/10	10.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文提出了一种名为DiPO的新方法，用于优化大型语言模型（LLMs）在强化学习中的探索-利用权衡。核心内容与LLMs直接相关（10分），属于RLHF/DPO等强化学习对齐技术范畴（10分），并在函数调用任务上进行了评估（10分）。论文涉及推理能力提升，与多步推理（5分）和深度推理（5分）有一定关联，且LLM智能体（5分）是其应用场景之一。其他关键词如MoE、量化、RAG等未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该论文针对大型语言模型在强化学习训练中面临的探索-利用权衡难题，提出了一种基于困惑度空间解耦和双向奖励分配的细粒度优化方法DiPO，在数学推理和函数调用任务上验证了其有效性。

摘要翻译

可验证奖励强化学习（RLVR）显著推动了大型语言模型（LLM）推理能力的发展。然而，如何有效管理探索与利用之间的权衡仍是一个关键挑战。本文深入分析了训练过程中极难与极易样本所引发的探索与利用困境，并提出了一种新的细粒度权衡机制。具体而言，我们引入了一种困惑度空间解耦策略，将样本空间划分为独立的探索（高困惑度）与利用（低困惑度）子空间，从而挖掘出需要探索-利用权衡的细粒度样本。随后，我们提出了一种对验证奖励影响最小的双向奖励分配机制，以实现困惑度引导的探索与利用，从而进行更稳定的策略优化。最后，我们在数学推理和函数调用两项主流任务上评估了所提方法，实验结果证明了该方法的优越性，并证实了其通过细粒度探索-利用权衡来提升LLM性能的有效性。

摘要 (Abstract)

Reinforcement Learning with Verifiable Rewards (RLVR) has catalyzed significant advances in the reasoning capabilities of Large Language Models (LLMs). However, effectively managing the exploration and exploitation trade-off remains a critical challenge. In this paper, we fully analyze the exploration and exploitation dilemma of extremely hard and easy samples during the training and propose a new fine-grained trade-off mechanism. Concretely, we introduce a perplexity space disentangling strategy that divides the sample space into distinct exploration (high perplexity) and exploitation (low perplexity) subspaces, thereby mining fine-grained samples requiring exploration-exploitation trade-off. Subsequently, we propose a bidirectional reward allocation mechanism with a minimum impact on verification rewards to implement perplexity-guided exploration and exploitation, enabling more stable policy optimization. Finally, we have evaluated our method on two mainstream tasks: mathematical reasoning and function calling, and experimental results demonstrate the superiority of the proposed method, confirming its effectiveness in enhancing LLM performance by fine-grained exploration-exploitation trade-off.

关键词: Reinforcement Learning, Large Language Models, Exploration-Exploitation Trade-Off, Perplexity Space, Policy Optimization, Mathematical Reasoning, Function Calling, Verifiable Rewards

8. Decoding the Delta: Unifying Remote Sensing Change Detection and Understanding with Multimodal Large

作者: Xiaohe Li, Jiahao Li, Kaixin Zhang, Yuqiang Fang, Leilei Lin, Hong Wang, Haohua Wu, Zide Fan 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.14044v1

评分: 40.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	5.0/10	5.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 论文核心是开发Delta-LLaVA，一个专门用于多时相遥感解释的多模态大语言模型（MLLM）框架，属于大模型在科学领域（遥感/地球观测）的创新应用。因此，与"Large Language Models"和"AI for Science"高度相关（10分）。论文涉及模型训练（如预训练/微调）和复杂推理（如多步推理、深度推理），但非核心，给5分。其他关键词（如MoE、量化、RAG等）未在摘要中提及，给0分。

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型在遥感变化理解中存在的"时间盲区"问题，提出了Delta-LLaVA框架，通过引入变化增强注意力等模块，在复杂变化推理和高精度边界定位上显著优于现有通用MLLM和专用分割模型。

摘要翻译

尽管多模态大语言模型（MLLMs）在通用视觉-语言任务中表现出色，但其在遥感变化理解中的应用却受到一种根本性的“时间盲区”的阻碍。现有架构缺乏内在的多时相对比推理机制，且难以实现精确的空间定位。为解决这一问题，我们首先引入了Delta-QA，这是一个包含18万个视觉问答样本的综合基准。Delta-QA统一了双时相和叁时相场景下的像素级分割与视觉问答，将变化解译构建为四个递进的认知维度。在方法论上，我们提出了Delta-LLaVA，这是一个专为多时相遥感解译设计的新型MLLM框架。它通过三项核心创新克服了简单特征拼接的局限：一个系统性地隔离并增强视觉差异的“变化增强注意力”模块；一个利用变化先验嵌入（Change Prior Embedding）来提取可微分差异特征作为大语言模型（LLM）输入的Change-SEG模块；以及用于防止跨时相上下文泄漏的局部因果注意力（Local Causal Attention）。大量实验表明，Delta-LLaVA在复杂变化推理和高精度边界定位方面，显著超越了领先的通用MLLMs和专用分割模型，从而为地球观测智能建立了一个统一的框架。

摘要 (Abstract)

While Multimodal Large Language Models (MLLMs) excel in general vision-language tasks, their application to remote sensing change understanding is hindered by a fundamental “temporal blindness”. Existing architectures lack intrinsic mechanisms for multi-temporal contrastive reasoning and struggle with precise spatial grounding. To address this, we first introduce Delta-QA, a comprehensive benchmark comprising 180k visual question-answering samples. Delta-QA unifies pixel-level segmentation and visual question answering across bi- and tri-temporal scenarios, structuring change interpretation into four progressive cognitive dimensions. Methodologically, we propose Delta-LLaVA, a novel MLLM framework explicitly tailored for multi-temporal remote sensing interpretation. It overcomes the limitations of naive feature concatenation through three core innovations: a Change-Enhanced Attention module that systematically isolates and amplifies visual differences, a Change-SEG module utilizing Change Prior Embedding to extract differentiable difference features as input for the LLM, and Local Causal Attention to prevent cross-temporal contextual leakage. Extensive experiments demonstrate that Delta-LLaVA decisively outperforms leading generalist MLLMs and specialized segmentation models in complex change deduction and high-precision boundary localization, establishing a unified framework for earth observation intelligence.

关键词: Multimodal Large Language Models, Remote Sensing, Change Detection, Temporal Blindness, Delta-LLaVA, Change-Enhanced Attention, Visual Question Answering, Earth Observation

9. Young people’s perceptions and recommendations for conversational generative artificial intelligence

作者: Adam Poulsen, Ian B. Hickie, Carla Gorban, Zsofi de Haan, William Capon, Ebenezer Eyeson-Annan, Jalal Radwan, Elizabeth M. Scott, Frank Iorfino, Haley M. LaMonica 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13381v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	5.0/10	5.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	5.0/10	5.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

评分理由: 论文研究的是对话式生成式人工智能（genAI chatbots）在青少年心理健康领域的应用，属于大模型在特定领域（医疗/心理健康）的应用研究。因此，与通用大模型技术（如LLMs）有中等关联（5分），因为genAI chatbots通常基于LLMs。论文涉及AI在科学/健康领域的应用（AI for Science），因此给5分。论文探讨了用户对AI的信任、透明度、伦理和设计需求，这与’Alignment’（价值对齐）、‘Explainable AI’（可解释性）、‘Hallucination Mitigation’（幻觉缓解）以及’LLM Agents’（AI代理）等概念有一定关联，但这些并非论文的技术核心，而是应用层面的考虑，因此均给5分。其他关键词主要涉及具体的技术原理、训练方法或优化技术，论文未涉及，因此给0分。

!!! tip deepseek-chat TL;DR

该研究通过共同设计工作坊，探讨了青少年对生成式AI聊天机器人在心理健康服务中的看法和需求，并提出了重新设计Mia聊天机器人以整合到服务中的关键建议。

摘要翻译

对话式生成人工智能代理（或称生成式AI聊天机器人）可能对青少年心理健康有益，但年轻人的视角仍未得到充分探索。本研究以最初为澳大利亚青少年服务机构专业人员设计的心理健康智能代理（Mia）为例展开调查。通过协同设计后，32名年轻人参与了线上研讨会，探讨他们对生成式AI聊天机器人应用于青少年心理健康的看法，并就如何为消费者重新设计Mia以及将其整合到服务中提出建议。研究归纳出四个主题：（1）在人性化AI的同时不使关怀去人性化；（2）我需要了解其运作原理；（3）合适的工具、场景与时机？（4）在安全环境中实现个性化定制。本研究揭示了年轻人对生成式AI聊天机器人应用于青少年心理健康的态度、需求及要求，对服务整合具有重要启示。此外，通过协同设计系统需求，这项工作为青少年心理健康领域生成式AI聊天机器人的伦理规范、设计开发、实施应用与治理监管提供了参考依据。

摘要 (Abstract)

Conversational generative artificial intelligence agents (or genAI chatbots) could benefit youth mental health, yet young people’s perspectives remain underexplored. We examined the Mental health Intelligence Agent (Mia), a genAI chatbot originally designed for professionals in Australian youth services. Following co-design, 32 young people participated in online workshops exploring their perceptions of genAI chatbots in youth mental health and to develop recommendations for reconceptualising Mia for consumers and integrating it into services. Four themes were developed: (1) Humanising AI without dehumanising care, (2) I need to know what’s under the hood, (3) Right tool, right place, right time?, and (4) Making it mine on safe ground. This study offers insights into young people’s attitudes, needs, and requirements regarding genAI chatbots in youth mental health, with key implications for service integration. Additionally, by co-designing system requirements, this work informs the ethics, design, development, implementation, and governance of genAI chatbots in youth mental health contexts.

关键词: conversational generative AI, youth mental health, genAI chatbot, co-design, service integration, ethics, user perceptions, Mental health Intelligence Agent

10. DiT as Real-Time Rerenderer: Streaming Video Stylization with Autoregressive Diffusion Transformer

作者: Hengye Lyu, Zisu Li, Yue Hong, Yueting Weng, Jiaxin Shi, Hanwang Zhang, Chen Liang 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13509v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	10.0/10	10.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	10.0/10	10.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文提出RTR-DiT，一个基于Diffusion Transformer的实时视频风格化框架。核心创新在于：1）通过post-training（含SFT）将教师模型蒸馏为自回归模型；2）提出KV cache更新策略以实现长视频稳定处理和实时风格切换。因此，与"Post-training” OR “Supervised Fine-tuning” OR “SFT"高度相关（10分），因为论文明确使用post-training和SFT进行蒸馏；与"KV Cache Compression” OR “Linear Attention” OR “FlashAttention"高度相关（10分），因为论文创新性地提出KV cache更新策略以优化长序列处理；与"Speculative Decoding” OR “Inference Acceleration"高度相关（10分），因为论文旨在实现实时视频风格化，核心是推理加速。其他关键词（如LLMs、MoE、Scaling Laws等）与论文的扩散模型视频生成主题无关，均得0分。加权总分计算为：101 + 101 + 10*1 = 30分。

!!! tip deepseek-chat TL;DR

该论文解决了现有扩散视频风格化方法在长视频处理中稳定性差、计算成本高的问题，提出RTR-DiT框架，通过后训练蒸馏和KV缓存策略，实现了实时、高质量的视频风格化，并在实验中优于现有方法。

摘要翻译

视频生成模型的最新进展显著加速了视频生成及相关下游任务的发展。其中，视频风格化在沉浸式应用与艺术创作等领域具有重要的研究价值，受到广泛关注。然而，现有的基于扩散模型的视频风格化方法在处理长视频时难以保持稳定性和一致性，且其高计算成本与多步去噪过程使其难以应用于实际场景。本工作提出RTR-DiT（以DiT作为实时重渲染器），这是一个基于扩散Transformer的流式视频风格化框架。我们首先在精选的视频风格化数据集上微调了一个双向教师模型，该模型支持文本引导与参考图引导的视频风格化任务，随后通过结合自强制与分布匹配蒸馏的后训练方法，将其蒸馏为少步自回归模型。此外，我们提出了一种参考保持的KV缓存更新策略，该策略不仅能够稳定、一致地处理长视频，还支持在文本提示与参考图像之间实时切换风格。实验结果表明，RTR-DiT在文本引导和参考图引导的视频风格化任务中，在量化指标与视觉质量上均优于现有方法，并在实时长视频风格化与交互式风格切换应用中展现出优异性能。

摘要 (Abstract)

Recent advances in video generation models has significantly accelerated video generation and related downstream tasks. Among these, video stylization holds important research value in areas such as immersive applications and artistic creation, attracting widespread attention. However, existing diffusion-based video stylization methods struggle to maintain stability and consistency when processing long videos, and their high computational cost and multi-step denoising make them difficult to apply in practical scenarios. In this work, we propose RTR-DiT (DiT as Real-Time Rerenderer), a steaming video stylization framework built upon Diffusion Transformer. We first fine-tune a bidirectional teacher model on a curated video stylization dataset, supporting both text-guided and reference-guided video stylization tasks, and subsequently distill it into a few-step autoregressive model via post-training with Self Forcing and Distribution Matching Distillation. Furthermore, we propose a reference-preserving KV cache update strategy that not only enables stable and consistent processing of long videos, but also supports real-time switching between text prompts and reference images. Experimental results show that RTR-DiT outperforms existing methods in both text-guided and reference-guided video stylization tasks, in terms of quantitative metrics and visual quality, and demonstrates excellent performance in real-time long video stylization and interactive style-switching applications.

关键词: video stylization, Diffusion Transformer, real-time, KV cache, autoregressive model, post-training, inference acceleration, long video processing

11. Don’t Let the Video Speak: Audio-Contrastive Preference Optimization for Audio-Visual Language Model

作者: Ami Baid, Zihui Xue, Kristen Grauman 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.14129v1

评分: 28.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	10.0/10	10.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	10.0/10	10.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文提出Audio-Contrastive Preference Optimization (ACPO)方法，专门针对Audio-Visual Language Models (AVLMs)中的跨模态幻觉问题，特别是视频驱动的音频幻觉。该方法属于偏好优化技术，与"RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO"高度相关（10分），因为ACPO是一种新的偏好学习框架。论文核心目标是缓解幻觉问题，与"Hallucination Mitigation” OR “Factuality” OR “Truthfulness"高度相关（10分）。论文涉及Audio-Visual Language Models，这是大语言模型在音频-视觉领域的应用，与"Large Language Models” OR “LLMs” OR “Foundation Models"有一定关联（8分）。其他关键词如MoE、SLMs、Scaling Laws、Pre-training、SFT、Instruction Tuning、PEFT、RAG、Context Window、KV Cache、Reasoning、Agents、Tool Use、Multi-agent、Quantization、Speculative Decoding、Interpretability、World Models、Model Merging、In-context Learning、AI for Science等均未在论文标题或摘要中提及，与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对音频-视觉语言模型中视频驱动的音频幻觉问题，提出了Audio-Contrastive Preference Optimization (ACPO)方法，有效提升了音频真实性并缓解了跨模态幻觉。

摘要翻译

尽管视听语言模型近年来取得了显著进展，但其可靠性受限于跨模态幻觉问题。其中一种尤为普遍的表现形式是视频驱动的音频幻觉：模型常常利用视觉捷径来幻觉预期声音，而丢弃真实的听觉证据。为应对这种根深蒂固的视觉主导倾向，我们提出了音频对比偏好优化方法。这一双轴偏好学习框架引入了输出对比目标，以惩罚伪装成音频事实的视觉描述；同时设计了输入对比目标，通过替换音轨来显式惩罚对真实听觉信号不敏感的生成行为。大量实验表明，ACPO能够建立高度可靠的音频基础，在保持整体多模态能力的同时有效缓解音频幻觉现象。

摘要 (Abstract)

While Audio-Visual Language Models (AVLMs) have achieved remarkable progress over recent years, their reliability is bottlenecked by cross-modal hallucination. A particularly pervasive manifestation is video-driven audio hallucination: models routinely exploit visual shortcuts to hallucinate expected sounds, discarding true auditory evidence. To counteract this deeply ingrained visual dominance, we propose Audio-Contrastive Preference Optimization (ACPO). This dual-axis preference learning framework introduces an output-contrastive objective to penalize visual descriptions masquerading as audio facts, alongside an input-contrastive objective that swaps audio tracks to explicitly penalize generation invariant to the true auditory signal. Extensive experiments demonstrate that ACPO establishes highly faithful audio grounding and mitigates audio hallucination without compromising overarching multimodal capabilities.

关键词: Audio-Visual Language Models, AVLMs, cross-modal hallucination, audio hallucination, preference optimization, Audio-Contrastive Preference Optimization, ACPO, audio grounding

📋 所有论文列表

1. ✅ Co-FactChecker: A Framework for Human-AI Collaborative Claim Verification Using Large Reasoning Models

作者: Dhruv Sahnan, Subhabrata Dutta, Tanmoy Chakraborty, Preslav Nakov, Iryna Gurevych 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13706v1

评分: 64.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	10.0/10	10.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	8.0/10	8.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	8.0/10	8.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	10.0/10	10.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	8.0/10	8.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文针对LLMs在声明验证中缺乏领域知识和深度上下文理解的问题，提出了Co-FactChecker人机协作框架，通过将专家反馈转化为思维轨迹编辑来指导模型推理，实验证明其优于现有自主和人机协作方法，并产生更高质量、更易解释的推理和结论。

摘要翻译

专业事实核查人员依赖领域知识和深层语境理解来验证主张。大型语言模型（LLMs）与大型推理模型（LRMs）缺乏这种基础，主要仅基于现有证据进行推理，这导致了专家主导与全自动化主张验证之间的不匹配。为弥合这一差距，我们提出人机协作是一条更具前景的路径，其中基于现实世界知识和领域专长的专家反馈可指导模型的推理过程。然而，现有的LRMs难以根据自然语言反馈进行校准，尤其是在多轮交互场景中。我们提出Co-FactChecker——一个用于人机协作主张验证的框架。我们引入了一种新的交互范式，将模型的思维轨迹视为共享草稿板。Co-FactChecker将专家反馈转化为轨迹编辑，从而对思维轨迹进行针对性修改，规避了基于对话的交互方式的缺陷。我们提供了理论分析，表明轨迹编辑相较于多轮对话具有优势；自动评估结果证明，Co-FactChecker在性能上超越了现有的自主验证及人机协作方法。人工评估进一步显示，相较于多轮对话，用户更倾向于选择Co-FactChecker，因其能产生更高质量的推理与判定结论，同时提供相对更易解读、更有价值的思维轨迹。

摘要 (Abstract)

Professional fact-checkers rely on domain knowledge and deep contextual understanding to verify claims. Large language models (LLMs) and large reasoning models (LRMs) lack such grounding and primarily reason from available evidence alone, creating a mismatch between expert-led and fully automated claim verification. To mitigate this gap, we posit human-AI collaboration as a more promising path forward, where expert feedback, grounded in real-world knowledge and domain expertise, guides the model’s reasoning. However, existing LRMs are hard to calibrate to natural language feedback, particularly in a multi-turn interaction setup. We propose Co-FactChecker, a framework for human-AI collaborative claim verification. We introduce a new interaction paradigm that treats the model’s thinking trace as a shared scratchpad. Co-FactChecker translates expert feedback into trace-edits that introduce targeted modifications to the trace, sidestepping the shortcomings of dialogue-based interaction. We provide theoretical results showing that trace-editing offers advantages over multi-turn dialogue, and our automatic evaluations demonstrate that Co-FactChecker outperforms existing autonomous and human-AI collaboration approaches. Human evaluations further show that Co-FactChecker is preferred over multi-turn dialogue, producing higher quality reasoning and verdicts along with relatively easier to interpret and more useful thinking traces.

关键词: human-AI collaboration, claim verification, large language models, large reasoning models, thinking trace, trace-editing, fact-checking, expert feedback

2. ✅ MIND: AI Co-Scientist for Material Research

作者: Geonhee Ahn, Donghyun Lee, Hayoung Doo, Jonggeol Na, Hyunsoo Cho, Sookyung Kim 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13699v1

评分: 55.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	5.0/10	5.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	10.0/10	10.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 论文核心是LLM驱动的多智能体框架MIND，用于材料研究的自动化假设验证，因此与"Large Language Models”、“LLM Agents”、“Multi-agent Systems"和"AI for Science"高度相关（10分）。系统涉及假设精炼和基于辩论的验证，与推理关键词有一定关联（5分）。其他关键词如MoE、SFT、RAG等未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个名为MIND的LLM驱动多智能体框架，通过集成机器学习原子间势能进行自动化假设验证，解决了材料研究中缺乏自动化实验验证的问题，并提供了可扩展的Web界面。

摘要翻译

大语言模型（LLMs）已赋能面向科学发现的智能体AI系统，但多数方法仍局限于基于文本的推理，缺乏自动化实验验证。我们提出MIND，一个用于材料研究中自动化假设验证的LLM驱动框架。MIND将科学发现过程组织为多智能体流程中的假设细化、实验执行和基于辩论的验证。为实现实验验证，该系统集成了机器学习原子间势（Machine Learning Interatomic Potentials），特别是SevenNet-Omni，以实现可扩展的计算机模拟实验。我们还提供了基于网络的用户界面以支持自动化假设测试。其模块化设计允许集成更多实验模块，使该框架能适应更广泛的科学工作流程。代码发布于：https://github.com/IMMS-Ewha/MIND，演示视频可见：https://youtu.be/lqiFe1OQzN4。

摘要 (Abstract)

Large language models (LLMs) have enabled agentic AI systems for scientific discovery, but most approaches remain limited to textbased reasoning without automated experimental verification. We propose MIND, an LLM-driven framework for automated hypothesis validation in materials research. MIND organizes the scientific discovery process into hypothesis refinement, experimentation, and debate-based validation within a multi-agent pipeline. For experimental verification, the system integrates Machine Learning Interatomic Potentials, particularly SevenNet-Omni, enabling scalable in-silico experiments. We also provide a web-based user interface for automated hypothesis testing. The modular design allows additional experimental modules to be integrated, making the framework adaptable to broader scientific workflows. The code is available at: https://github.com/IMMS-Ewha/MIND, and a demonstration video at: https://youtu.be/lqiFe1OQzN4.

3. ✅ GeoAgentBench: A Dynamic Execution Benchmark for Tool-Augmented Agents in Spatial Analysis

评分: 54.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	8.0/10	8.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	8.0/10	8.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	10.0/10	10.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	8.0/10	8.0

!!! tip deepseek-chat TL;DR

该论文针对LLM-based GIS agents在复杂空间分析中评估困难的问题，提出了动态评估基准GeoAgentBench和Plan-and-React架构，显著提升了多步推理和错误恢复能力。

摘要翻译

将大型语言模型（LLMs）集成到地理信息系统（GIS）中，标志着空间分析向自主化发展的范式转变。然而，由于地理空间工作流程具有复杂、多步骤的特性，评估这些基于LLM的智能体仍然面临挑战。现有基准测试主要依赖静态文本或代码匹配，忽略了动态的运行时反馈以及空间输出的多模态特性。为弥补这一不足，我们提出了GeoAgentBench（GABench），这是一个为工具增强型GIS智能体量身定制的动态交互式评估基准。GABench提供了一个真实的执行沙箱，集成了117个原子级GIS工具，覆盖了6个核心GIS领域的53类典型空间分析任务。我们认识到，在动态GIS环境中，精确的参数配置是执行成功的主要决定因素，因此设计了参数执行准确度（Parameter Execution Accuracy, PEA）指标，该指标采用“末次尝试对齐”策略来量化隐式参数推断的保真度。作为补充，我们提出了一种基于视觉语言模型（Vision-Language Model, VLM）的验证方法，用于评估数据空间准确性和制图风格的符合度。此外，针对因参数错配和运行时异常导致的频繁任务失败，我们开发了一种新颖的智能体架构——规划与反应（Plan-and-React），该架构通过将全局编排与逐步反应式执行解耦，模拟了专家的认知工作流程。对七个代表性LLM的大量实验表明，规划与反应范式显著优于传统框架，在逻辑严谨性和执行鲁棒性之间实现了最佳平衡，尤其是在多步骤推理和错误恢复方面。我们的研究结果揭示了当前的能力边界，并为评估和推进下一代自主地理人工智能（GeoAI）建立了坚实的标准。

摘要 (Abstract)

The integration of Large Language Models (LLMs) into Geographic Information Systems (GIS) marks a paradigm shift toward autonomous spatial analysis. However, evaluating these LLM-based agents remains challenging due to the complex, multi-step nature of geospatial workflows. Existing benchmarks primarily rely on static text or code matching, neglecting dynamic runtime feedback and the multimodal nature of spatial outputs. To address this gap, we introduce GeoAgentBench (GABench), a dynamic and interactive evaluation benchmark tailored for tool-augmented GIS agents. GABench provides a realistic execution sandbox integrating 117 atomic GIS tools, encompassing 53 typical spatial analysis tasks across 6 core GIS domains. Recognizing that precise parameter configuration is the primary determinant of execution success in dynamic GIS environments, we designed the Parameter Execution Accuracy (PEA) metric, which utilizes a “Last-Attempt Alignment” strategy to quantify the fidelity of implicit parameter inference. Complementing this, a Vision-Language Model (VLM) based verification is proposed to assess data-spatial accuracy and cartographic style adherence. Furthermore, to address the frequent task failures caused by parameter misalignments and runtime anomalies, we developed a novel agent architecture, Plan-and-React, that mimics expert cognitive workflows by decoupling global orchestration from step-wise reactive execution. Extensive experiments with seven representative LLMs demonstrate that the Plan-and-React paradigm significantly outperforms traditional frameworks, achieving the optimal balance between logical rigor and execution robustness, particularly in multi-step reasoning and error recovery. Our findings highlight current capability boundaries and establish a robust standard for assessing and advancing the next generation of autonomous GeoAI.

关键词: Large Language Models, LLM-based agents, GIS, spatial analysis, tool-augmented agents, multi-step reasoning, Plan-and-React, GeoAgentBench

4. ✅ LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning

评分: 50.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	5.0/10	5.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	15.0/10	15.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	10.0/10	10.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	5.0/10	5.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

!!! tip deepseek-chat TL;DR

该论文提出了LongCoT基准来评估大语言模型在长链思维推理上的能力，发现当前前沿模型（如GPT-5.2和Gemini 3 Pro）的准确率低于10%，揭示了其在长时程推理上的显著不足。

摘要翻译

随着语言模型日益广泛地应用于复杂的自主任务，其进行长程准确推理的能力变得至关重要。该能力的核心组成部分是对长而复杂的思维链进行规划与管理。本文提出LongCoT，这是一个包含2,500个专家设计问题的可扩展基准测试集，涵盖化学、数学、计算机科学、国际象棋和逻辑等领域，旨在隔离并直接衡量前沿模型的长程思维链推理能力。每个问题由一个简短输入和一个可验证答案构成；解决这些问题需要在一个由相互依赖的步骤构成的图中进行导航，这些步骤涉及的推理标记数量可达数万至数十万。每个局部步骤对前沿模型而言均可独立处理，因此失败案例反映了长程推理的局限性。在发布时，最佳模型在LongCoT上的准确率低于10%（GPT 5.2：9.8%；Gemini 3 Pro：6.1%），揭示了当前能力存在显著差距。总体而言，LongCoT为长程推理提供了严格的衡量标准，能够追踪前沿模型在长时间跨度内进行可靠推理的能力。

摘要 (Abstract)

As language models are increasingly deployed for complex autonomous tasks, their ability to reason accurately over longer horizons becomes critical. An essential component of this ability is planning and managing a long, complex chain-of-thought (CoT). We introduce LongCoT, a scalable benchmark of 2,500 expert-designed problems spanning chemistry, mathematics, computer science, chess, and logic to isolate and directly measure the long-horizon CoT reasoning capabilities of frontier models. Problems consist of a short input with a verifiable answer; solving them requires navigating a graph of interdependent steps that span tens to hundreds of thousands of reasoning tokens. Each local step is individually tractable for frontier models, so failures reflect long-horizon reasoning limitations. At release, the best models achieve <10% accuracy (GPT 5.2: 9.8%; Gemini 3 Pro: 6.1%) on LongCoT, revealing a substantial gap in current capabilities. Overall, LongCoT provides a rigorous measure of long-horizon reasoning, tracking the ability of frontier models to reason reliably over extended periods.

关键词: LongCoT, Chain-of-Thought, Long-horizon reasoning, Language models, Benchmark, Autonomous tasks, Reasoning capabilities, Frontier models

5. ✅ Foresight Optimization for Strategic Reasoning in Large Language Models

评分: 46.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	8.0/10	8.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	8.0/10	8.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	10.0/10	10.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文针对大语言模型在多智能体环境中战略推理能力不足的问题，提出了Foresight Policy Optimization（FoPO）方法，通过整合对手建模原理来增强LLMs的战略决策能力，实验表明FoPO显著提升了不同规模和来源LLMs的战略推理能力，并展现出强大的泛化性能。

摘要翻译

大型语言模型（LLM）的推理能力已普遍取得显著进展。然而，由于缺乏显式的前瞻建模，现有基于推理的LLM在多智能体环境中仍难以展现有效的决策能力。为此，战略推理——即预测对手行为并预见其未来可能行动的最基本能力——被引入以缓解上述问题。战略推理是多智能体环境中有效决策的基础，但现有的LLM推理增强方法未能明确捕捉其前瞻性本质。本研究提出前瞻策略优化（Foresight Policy Optimization, FoPO）以增强LLM的战略推理能力，该方法将对手建模原则融入策略优化过程，从而能够显式地兼顾自身利益与对手影响。具体而言，我们构建了两个精心设计的数据集——合作性RSA与竞争性Taboo，这些数据集配备了精心设计的规则与适中难度，以促进在自博弈框架下对FoPO进行系统性研究。实验表明，FoPO显著提升了不同规模和来源的LLM的战略推理能力。此外，经FoPO训练的模型在领域外战略场景中展现出强大的泛化能力，其表现大幅优于标准的LLM推理优化基线方法。

摘要 (Abstract)

Reasoning capabilities in large language models (LLMs) have generally advanced significantly. However, it is still challenging for existing reasoning-based LLMs to perform effective decision-making abilities in multi-agent environments, due to the absence of explicit foresight modeling. To this end, strategic reasoning, the most fundamental capability to anticipate the counterpart’s behaviors and foresee its possible future actions, has been introduced to alleviate the above issues. Strategic reasoning is fundamental to effective decision-making in multi-agent environments, yet existing reasoning enhancement methods for LLMs do not explicitly capture its foresight nature. In this work, we introduce Foresight Policy Optimization (FoPO) to enhance strategic reasoning in LLMs, which integrates opponent modeling principles into policy optimization, thereby enabling explicit consideration of both self-interest and counterpart influence. Specifically, we construct two curated datasets, namely Cooperative RSA and Competitive Taboo, equipped with well-designed rules and moderate difficulty to facilitate a systematic investigation of FoPO in a self-play framework. Our experiments demonstrate that FoPO significantly enhances strategic reasoning across LLMs of varying sizes and origins. Moreover, models trained with FoPO exhibit strong generalization to out-of-domain strategic scenarios, substantially outperforming standard LLM reasoning optimization baselines.

关键词: Large Language Models, Strategic Reasoning, Multi-agent Environments, Foresight Policy Optimization, Decision-making, Opponent Modeling, Self-play Framework, Generalization

6. ✅ Enhancing Reinforcement Learning for Radiology Report Generation with Evidence-aware Rewards and Self-correcting Preference Learning

作者: Qin Zhou, Guoyan Liang, Qianyi Yang, Jingyuan Chen, Sai Wu, Chang Yao, Zhe Wang 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13598v1

评分: 46.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	5.0/10	5.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	10.0/10	10.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	8.0/10	8.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

!!! tip deepseek-chat TL;DR

该研究针对放射学报告生成中奖励缺乏证据基础和缺乏自改进机制的问题，提出了证据感知自校正强化学习方法，通过组级证据对齐奖励和自校正偏好学习策略，在公开胸部X光数据集上实现了性能提升和最优结果。

摘要翻译

近期强化学习方法在放射学报告生成领域取得进展，但两个核心局限依然存在：（1）报告级奖励对临床忠实性提供的证据基础指导有限；（2）现有方法缺乏与临床偏好对齐的显式自我改进机制。我们提出临床对齐的证据感知自校正强化学习方法，其包含两个关键组件。首先，分组式证据感知对齐奖励机制提供分组化、证据感知的反馈。该方法通过强化真阳性的一致性依据、恢复假阴性的漏诊发现、抑制假阳性的无支持内容来实现优化。其次，自校正偏好学习策略能够从多组噪声观察中自动构建可靠的疾病感知偏好数据集，并利用大语言模型在无需人工监督的情况下合成精细化报告。该框架促进了临床忠实、疾病对齐的奖励机制，并支持训练过程中的持续自我改进。在两个公开胸部X光数据集上的大量实验表明，该方法实现了持续性能提升并达到最先进的性能水平。

摘要 (Abstract)

Recent reinforcement learning (RL) approaches have advanced radiology report generation (RRG), yet two core limitations persist: (1) report-level rewards offer limited evidence-grounded guidance for clinical faithfulness; and (2) current methods lack an explicit self-improving mechanism to align with clinical preference. We introduce clinically aligned Evidence-aware Self-Correcting Reinforcement Learning (ESC-RL), comprising two key components. First, a Group-wise Evidence-aware Alignment Reward (GEAR) delivers group-wise, evidence-aware feedback. GEAR reinforces consistent grounding for true positives, recovers missed findings for false negatives, and suppresses unsupported content for false positives. Second, a Self-correcting Preference Learning (SPL) strategy automatically constructs a reliable, disease-aware preference dataset from multiple noisy observations and leverages an LLM to synthesize refined reports without human supervision. ESC-RL promotes clinically faithful, disease-aligned reward and supports continual self-improvement during training. Extensive experiments on two public chest X-ray datasets demonstrate consistent gains and state-of-the-art performance.

7. ✅ DiPO: Disentangled Perplexity Policy Optimization for Fine-grained Exploration-Exploitation Trade-Off

评分: 45.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	10.0/10	10.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	5.0/10	5.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	10.0/10	10.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文针对大型语言模型在强化学习训练中面临的探索-利用权衡难题，提出了一种基于困惑度空间解耦和双向奖励分配的细粒度优化方法DiPO，在数学推理和函数调用任务上验证了其有效性。

摘要翻译

可验证奖励强化学习（RLVR）显著推动了大型语言模型（LLM）推理能力的发展。然而，如何有效管理探索与利用之间的权衡仍是一个关键挑战。本文深入分析了训练过程中极难与极易样本所引发的探索与利用困境，并提出了一种新的细粒度权衡机制。具体而言，我们引入了一种困惑度空间解耦策略，将样本空间划分为独立的探索（高困惑度）与利用（低困惑度）子空间，从而挖掘出需要探索-利用权衡的细粒度样本。随后，我们提出了一种对验证奖励影响最小的双向奖励分配机制，以实现困惑度引导的探索与利用，从而进行更稳定的策略优化。最后，我们在数学推理和函数调用两项主流任务上评估了所提方法，实验结果证明了该方法的优越性，并证实了其通过细粒度探索-利用权衡来提升LLM性能的有效性。

摘要 (Abstract)

Reinforcement Learning with Verifiable Rewards (RLVR) has catalyzed significant advances in the reasoning capabilities of Large Language Models (LLMs). However, effectively managing the exploration and exploitation trade-off remains a critical challenge. In this paper, we fully analyze the exploration and exploitation dilemma of extremely hard and easy samples during the training and propose a new fine-grained trade-off mechanism. Concretely, we introduce a perplexity space disentangling strategy that divides the sample space into distinct exploration (high perplexity) and exploitation (low perplexity) subspaces, thereby mining fine-grained samples requiring exploration-exploitation trade-off. Subsequently, we propose a bidirectional reward allocation mechanism with a minimum impact on verification rewards to implement perplexity-guided exploration and exploitation, enabling more stable policy optimization. Finally, we have evaluated our method on two mainstream tasks: mathematical reasoning and function calling, and experimental results demonstrate the superiority of the proposed method, confirming its effectiveness in enhancing LLM performance by fine-grained exploration-exploitation trade-off.

关键词: Reinforcement Learning, Large Language Models, Exploration-Exploitation Trade-Off, Perplexity Space, Policy Optimization, Mathematical Reasoning, Function Calling, Verifiable Rewards

8. ✅ Decoding the Delta: Unifying Remote Sensing Change Detection and Understanding with Multimodal Large Language Models

评分: 40.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	5.0/10	5.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型在遥感变化理解中存在的"时间盲区"问题，提出了Delta-LLaVA框架，通过引入变化增强注意力等模块，在复杂变化推理和高精度边界定位上显著优于现有通用MLLM和专用分割模型。

摘要翻译

尽管多模态大语言模型（MLLMs）在通用视觉-语言任务中表现出色，但其在遥感变化理解中的应用却受到一种根本性的“时间盲区”的阻碍。现有架构缺乏内在的多时相对比推理机制，且难以实现精确的空间定位。为解决这一问题，我们首先引入了Delta-QA，这是一个包含18万个视觉问答样本的综合基准。Delta-QA统一了双时相和叁时相场景下的像素级分割与视觉问答，将变化解译构建为四个递进的认知维度。在方法论上，我们提出了Delta-LLaVA，这是一个专为多时相遥感解译设计的新型MLLM框架。它通过三项核心创新克服了简单特征拼接的局限：一个系统性地隔离并增强视觉差异的“变化增强注意力”模块；一个利用变化先验嵌入（Change Prior Embedding）来提取可微分差异特征作为大语言模型（LLM）输入的Change-SEG模块；以及用于防止跨时相上下文泄漏的局部因果注意力（Local Causal Attention）。大量实验表明，Delta-LLaVA在复杂变化推理和高精度边界定位方面，显著超越了领先的通用MLLMs和专用分割模型，从而为地球观测智能建立了一个统一的框架。

摘要 (Abstract)

While Multimodal Large Language Models (MLLMs) excel in general vision-language tasks, their application to remote sensing change understanding is hindered by a fundamental “temporal blindness”. Existing architectures lack intrinsic mechanisms for multi-temporal contrastive reasoning and struggle with precise spatial grounding. To address this, we first introduce Delta-QA, a comprehensive benchmark comprising 180k visual question-answering samples. Delta-QA unifies pixel-level segmentation and visual question answering across bi- and tri-temporal scenarios, structuring change interpretation into four progressive cognitive dimensions. Methodologically, we propose Delta-LLaVA, a novel MLLM framework explicitly tailored for multi-temporal remote sensing interpretation. It overcomes the limitations of naive feature concatenation through three core innovations: a Change-Enhanced Attention module that systematically isolates and amplifies visual differences, a Change-SEG module utilizing Change Prior Embedding to extract differentiable difference features as input for the LLM, and Local Causal Attention to prevent cross-temporal contextual leakage. Extensive experiments demonstrate that Delta-LLaVA decisively outperforms leading generalist MLLMs and specialized segmentation models in complex change deduction and high-precision boundary localization, establishing a unified framework for earth observation intelligence.

关键词: Multimodal Large Language Models, Remote Sensing, Change Detection, Temporal Blindness, Delta-LLaVA, Change-Enhanced Attention, Visual Question Answering, Earth Observation

9. ✅ Young people’s perceptions and recommendations for conversational generative artificial intelligence in youth mental health

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	5.0/10	5.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	5.0/10	5.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	5.0/10	5.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

!!! tip deepseek-chat TL;DR

该研究通过共同设计工作坊，探讨了青少年对生成式AI聊天机器人在心理健康服务中的看法和需求，并提出了重新设计Mia聊天机器人以整合到服务中的关键建议。

摘要翻译

对话式生成人工智能代理（或称生成式AI聊天机器人）可能对青少年心理健康有益，但年轻人的视角仍未得到充分探索。本研究以最初为澳大利亚青少年服务机构专业人员设计的心理健康智能代理（Mia）为例展开调查。通过协同设计后，32名年轻人参与了线上研讨会，探讨他们对生成式AI聊天机器人应用于青少年心理健康的看法，并就如何为消费者重新设计Mia以及将其整合到服务中提出建议。研究归纳出四个主题：（1）在人性化AI的同时不使关怀去人性化；（2）我需要了解其运作原理；（3）合适的工具、场景与时机？（4）在安全环境中实现个性化定制。本研究揭示了年轻人对生成式AI聊天机器人应用于青少年心理健康的态度、需求及要求，对服务整合具有重要启示。此外，通过协同设计系统需求，这项工作为青少年心理健康领域生成式AI聊天机器人的伦理规范、设计开发、实施应用与治理监管提供了参考依据。

摘要 (Abstract)

Conversational generative artificial intelligence agents (or genAI chatbots) could benefit youth mental health, yet young people’s perspectives remain underexplored. We examined the Mental health Intelligence Agent (Mia), a genAI chatbot originally designed for professionals in Australian youth services. Following co-design, 32 young people participated in online workshops exploring their perceptions of genAI chatbots in youth mental health and to develop recommendations for reconceptualising Mia for consumers and integrating it into services. Four themes were developed: (1) Humanising AI without dehumanising care, (2) I need to know what’s under the hood, (3) Right tool, right place, right time?, and (4) Making it mine on safe ground. This study offers insights into young people’s attitudes, needs, and requirements regarding genAI chatbots in youth mental health, with key implications for service integration. Additionally, by co-designing system requirements, this work informs the ethics, design, development, implementation, and governance of genAI chatbots in youth mental health contexts.

关键词: conversational generative AI, youth mental health, genAI chatbot, co-design, service integration, ethics, user perceptions, Mental health Intelligence Agent

10. ✅ DiT as Real-Time Rerenderer: Streaming Video Stylization with Autoregressive Diffusion Transformer

作者: Hengye Lyu, Zisu Li, Yue Hong, Yueting Weng, Jiaxin Shi, Hanwang Zhang, Chen Liang 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13509v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	10.0/10	10.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	10.0/10	10.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文解决了现有扩散视频风格化方法在长视频处理中稳定性差、计算成本高的问题，提出RTR-DiT框架，通过后训练蒸馏和KV缓存策略，实现了实时、高质量的视频风格化，并在实验中优于现有方法。

摘要翻译

视频生成模型的最新进展显著加速了视频生成及相关下游任务的发展。其中，视频风格化在沉浸式应用与艺术创作等领域具有重要的研究价值，受到广泛关注。然而，现有的基于扩散模型的视频风格化方法在处理长视频时难以保持稳定性和一致性，且其高计算成本与多步去噪过程使其难以应用于实际场景。本工作提出RTR-DiT（以DiT作为实时重渲染器），这是一个基于扩散Transformer的流式视频风格化框架。我们首先在精选的视频风格化数据集上微调了一个双向教师模型，该模型支持文本引导与参考图引导的视频风格化任务，随后通过结合自强制与分布匹配蒸馏的后训练方法，将其蒸馏为少步自回归模型。此外，我们提出了一种参考保持的KV缓存更新策略，该策略不仅能够稳定、一致地处理长视频，还支持在文本提示与参考图像之间实时切换风格。实验结果表明，RTR-DiT在文本引导和参考图引导的视频风格化任务中，在量化指标与视觉质量上均优于现有方法，并在实时长视频风格化与交互式风格切换应用中展现出优异性能。

摘要 (Abstract)

Recent advances in video generation models has significantly accelerated video generation and related downstream tasks. Among these, video stylization holds important research value in areas such as immersive applications and artistic creation, attracting widespread attention. However, existing diffusion-based video stylization methods struggle to maintain stability and consistency when processing long videos, and their high computational cost and multi-step denoising make them difficult to apply in practical scenarios. In this work, we propose RTR-DiT (DiT as Real-Time Rerenderer), a steaming video stylization framework built upon Diffusion Transformer. We first fine-tune a bidirectional teacher model on a curated video stylization dataset, supporting both text-guided and reference-guided video stylization tasks, and subsequently distill it into a few-step autoregressive model via post-training with Self Forcing and Distribution Matching Distillation. Furthermore, we propose a reference-preserving KV cache update strategy that not only enables stable and consistent processing of long videos, but also supports real-time switching between text prompts and reference images. Experimental results show that RTR-DiT outperforms existing methods in both text-guided and reference-guided video stylization tasks, in terms of quantitative metrics and visual quality, and demonstrates excellent performance in real-time long video stylization and interactive style-switching applications.

关键词: video stylization, Diffusion Transformer, real-time, KV cache, autoregressive model, post-training, inference acceleration, long video processing

11. ✅ Don’t Let the Video Speak: Audio-Contrastive Preference Optimization for Audio-Visual Language Models

作者: Ami Baid, Zihui Xue, Kristen Grauman 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.14129v1

评分: 28.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	10.0/10	10.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	10.0/10	10.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文针对音频-视觉语言模型中视频驱动的音频幻觉问题，提出了Audio-Contrastive Preference Optimization (ACPO)方法，有效提升了音频真实性并缓解了跨模态幻觉。

摘要翻译

尽管视听语言模型近年来取得了显著进展，但其可靠性受限于跨模态幻觉问题。其中一种尤为普遍的表现形式是视频驱动的音频幻觉：模型常常利用视觉捷径来幻觉预期声音，而丢弃真实的听觉证据。为应对这种根深蒂固的视觉主导倾向，我们提出了音频对比偏好优化方法。这一双轴偏好学习框架引入了输出对比目标，以惩罚伪装成音频事实的视觉描述；同时设计了输入对比目标，通过替换音轨来显式惩罚对真实听觉信号不敏感的生成行为。大量实验表明，ACPO能够建立高度可靠的音频基础，在保持整体多模态能力的同时有效缓解音频幻觉现象。

摘要 (Abstract)

While Audio-Visual Language Models (AVLMs) have achieved remarkable progress over recent years, their reliability is bottlenecked by cross-modal hallucination. A particularly pervasive manifestation is video-driven audio hallucination: models routinely exploit visual shortcuts to hallucinate expected sounds, discarding true auditory evidence. To counteract this deeply ingrained visual dominance, we propose Audio-Contrastive Preference Optimization (ACPO). This dual-axis preference learning framework introduces an output-contrastive objective to penalize visual descriptions masquerading as audio facts, alongside an input-contrastive objective that swaps audio tracks to explicitly penalize generation invariant to the true auditory signal. Extensive experiments demonstrate that ACPO establishes highly faithful audio grounding and mitigates audio hallucination without compromising overarching multimodal capabilities.

关键词: Audio-Visual Language Models, AVLMs, cross-modal hallucination, audio hallucination, preference optimization, Audio-Contrastive Preference Optimization, ACPO, audio grounding

12. ❌ C2: Scalable Rubric-Augmented Reward Modeling from Binary Preferences

作者: Akira Kawabata, Saku Sugawara 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13618v1

评分: 26.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	8.0/10	8.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	10.0/10	10.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文专注于奖励模型（reward modeling）的训练方法，属于大模型对齐（alignment）和RLHF/DPO领域。核心创新是提出C2框架，通过合作式批判性协作，从二元偏好中训练奖励模型和准则生成器，无需外部准则标注。因此，与"RLHF/DPO"高度相关（10分），与"Alignment"相关（8分），与"Large Language Models"相关（8分，因奖励模型通常用于LLM对齐）。其他关键词如MoE、SLMs、Scaling Laws、PEFT、RAG等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为C2的奖励建模框架，通过让奖励模型与仅从二元偏好训练的准则生成器进行批判性协作，无需外部准则标注即可显著提升奖励模型的判断可靠性，在多个基准测试中优于现有方法。

摘要翻译

基于评分量表的增强验证方法通过明确的评估标准指导奖励模型，其判断结果比单一模型验证更为可靠。然而，现有方法大多依赖成本高昂的量表标注，限制了可扩展性。此外，我们发现量表生成存在协作失效的风险：低质量量表会主动误导奖励模型而非提供帮助。受协作沟通原则启发，我们提出“协作且批判的奖励建模”（C2）框架，该框架通过使奖励模型与仅基于二元偏好训练的评分量表生成器进行批判性协作，显著提升了奖励模型的判断质量。在C2中，我们通过测量每个评分量表使奖励模型趋近或偏离正确偏好的程度，合成“有益”与“误导性”的量表对。利用这些对比对，我们训练一个协作式量表生成器以提出有益量表，并训练一个批判性验证器在作出判断前评估量表有效性——在推理阶段仅采纳其认定为有益的量表。C2在相同二元偏好数据上训练的表现优于推理型奖励模型，在RM-Bench上提升达6.5分，在AlpacaEval 2.0的长度控制胜率上提升6.0分。无需外部量表标注，C2使一个80亿参数的奖励模型达到了比其大4倍的模型使用量表标注时的性能。总体而言，我们的研究表明，在基于量表的增强验证中激发审慎协作，能够以可扩展的方式使奖励模型变得更值得信赖。

摘要 (Abstract)

Rubric-augmented verification guides reward models with explicit evaluation criteria, yielding more reliable judgments than single-model verification. However, most existing methods require costly rubric annotations, limiting scalability. Moreover, we find that rubric generation is vulnerable to a failure of cooperation; low-quality rubrics actively mislead reward models rather than help. Inspired by the principle of cooperative communication, we propose Cooperative yet Critical reward modeling (C2), a framework that significantly improves reward model judgments by having the reward model critically collaborate with a rubric generator trained solely from binary preferences. In C2, we synthesize helpful and misleading rubric pairs by measuring how each rubric shifts the reward model toward or away from the correct preference. Using these contrastive pairs, we train a cooperative rubric generator to propose helpful rubrics, and a critical verifier to assess rubric validity before making its judgment, following only rubrics it deems helpful at inference time. C2 outperforms reasoning reward models trained on the same binary preferences, with gains of up to 6.5 points on RM-Bench and 6.0 points length-controlled win rate on AlpacaEval 2.0. Without external rubric annotations, C2 enables an 8B reward model to match performance achieved with rubrics from a 4$\times$ larger model. Overall, our work demonstrates that eliciting deliberate cooperation in rubric-augmented verification makes reward models more trustworthy in a scalable way.

关键词: reward modeling, rubric-augmented verification, binary preferences, cooperative communication, critical collaboration, scalable framework, alignment, RLHF

13. ❌ Rhetorical Questions in LLM Representations: A Linear Probing Study

作者: Louie Hong Yao, Vishesh Anand, Yuan Zhuang, Tianyu Jiang 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.14128v1

评分: 20.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	10.0/10	10.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLM内部表示的分析方法（线性探测），属于大模型技术原理的创新研究。与"Large Language Models"高度相关（10分），因为全文围绕LLM表征分析展开。与"Mechanistic Interpretability"高度相关（10分），因为线性探测是解释性AI/机制可解释性的典型方法，论文旨在理解LLM如何编码修辞问题。其他关键词如MoE、SFT、RAG等均未在摘要中提及或相关，故给0分。

!!! tip deepseek-chat TL;DR

该研究通过线性探测分析LLM如何内部表示修辞问题，发现修辞信号在早期层出现且能被最后一词表征稳定捕获，不同数据集训练的探针捕获不同的修辞现象，表明修辞问题由多个线性方向编码而非单一共享方向。

摘要翻译

反问句的提出并非为了获取信息，而是旨在说服或表明立场。大型语言模型如何在内部表征此类问句，目前尚不明确。本研究通过在线性探针上分析两个不同话语情境的社交媒体数据集中的反问句，发现反问信号在模型表征中早期出现，且最稳定地由末词表征所捕获。在同一数据集内，反问句与信息寻求型问句呈线性可分状态，并在跨数据集迁移中仍可被检测，其AUROC值约为0.7-0.8。然而，我们发现这种可迁移性并不简单地意味着存在共享表征。在不同数据集上训练的探针应用于同一目标语料时，会产生不同的排序结果，其排名靠前的实例重叠率通常低于0.2。定性分析表明，这些差异对应着不同的修辞现象：部分探针捕捉到嵌入在长篇论证中的话语层面修辞立场，而其他探针则侧重于局部、句法驱动的疑问行为。综合来看，这些发现表明，大型语言模型对反问句的表征是通过多个强调不同线索的线性方向编码的，而非单一共享方向。

摘要 (Abstract)

Rhetorical questions are asked not to seek information but to persuade or signal stance. How large language models internally represent them remains unclear. We analyze rhetorical questions in LLM representations using linear probes on two social-media datasets with different discourse contexts, and find that rhetorical signals emerge early and are most stably captured by last-token representations. Rhetorical questions are linearly separable from information-seeking questions within datasets, and remain detectable under cross-dataset transfer, reaching AUROC around 0.7-0.8. However, we demonstrate that transferability does not simply imply a shared representation. Probes trained on different datasets produce different rankings when applied to the same target corpus, with overlap among the top-ranked instances often below 0.2. Qualitative analysis shows that these divergences correspond to distinct rhetorical phenomena: some probes capture discourse-level rhetorical stance embedded in extended argumentation, while others emphasize localized, syntax-driven interrogative acts. Together, these findings suggest that rhetorical questions in LLM representations are encoded by multiple linear directions emphasizing different cues, rather than a single shared direction.

关键词: Large Language Models, LLM representations, linear probing, rhetorical questions, interpretability, transfer learning, AUROC, discourse analysis

14. ❌ Adaptive Conformal Prediction for Improving Factuality of Generations by Large Language Models

作者: Aleksandr Rubashevskii, Dzianis Piatrashyn, Preslav Nakov, Maxim Panov 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13991v1

评分: 20.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	10.0/10	10.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLMs的事实性错误问题，提出自适应共形预测方法改进事实性评估，与"Large Language Models"和"Hallucination Mitigation"高度相关（10分），其他关键词如MoE、SLMs、Scaling Laws、训练方法、推理技术、代理系统、压缩加速、科学AI等均未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型生成内容的事实性错误问题，提出了一种自适应共形预测方法，实现了提示依赖的校准，在保持边际覆盖保证的同时显著提高了条件覆盖性能。

摘要翻译

大型语言模型（LLM）倾向于生成事实错误的输出。近期研究应用了共形预测来为LLM生成内容的事实性提供不确定性估计与统计保证。然而，现有方法通常不具备提示自适应性，限制了其捕捉输入相关变异性的能力。这导致在特定任务或提示下，它们可能过滤掉过少项目（导致过度覆盖）或过多项目（覆盖不足）。我们提出一种自适应共形预测方法，将共形分数转换技术扩展至LLM，并应用于长文本生成与多项选择题回答任务。该方法实现了提示依赖的校准，在保持边际覆盖保证的同时提升了条件覆盖性能。此外，该方法天然支持选择性预测，可在下游应用中过滤不可靠的陈述或答案选项。我们在跨领域的多个白盒模型上评估了该方法，结果表明其在条件覆盖方面显著优于现有基线。

摘要 (Abstract)

Large language models (LLMs) are prone to generating factually incorrect outputs. Recent work has applied conformal prediction to provide uncertainty estimates and statistical guarantees for the factuality of LLM generations. However, existing approaches are typically not prompt-adaptive, limiting their ability to capture input-dependent variability. As a result, they may filter out too few items (leading to over-coverage) or too many (under-coverage) for a given task or prompt. We propose an adaptive conformal prediction approach that extends conformal score transformation methods to LLMs, with applications to long-form generation and multiple-choice question answering. This enables prompt-dependent calibration, retaining marginal coverage guarantees while improving conditional coverage. In addition, the approach naturally supports selective prediction, allowing unreliable claims or answer choices to be filtered out in downstream applications. We evaluate our approach on multiple white-box models across diverse domains and show that it significantly outperforms existing baselines in terms of conditional coverage.

关键词: Large language models, Factuality, Conformal prediction, Uncertainty estimation, Selective prediction, Conditional coverage, Long-form generation, Multiple-choice question answering

15. ❌ Interpretable Stylistic Variation in Human and LLM Writing Across Genres, Models, and Decoding Strategies

作者: Swati Rallapalli, Shannon Gallagher, Ronald Yurko, Tyler Brooks, Chuck Loughin, Michele Sezgin, Violet Turri 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.14111v1

评分: 20.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	10.0/10	10.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLM生成文本与人类文本的风格差异分析，直接涉及"Large Language Models"和"Mechanistic Interpretability”（通过语言特征分析解释模型行为），其他关键词如MoE、SLMs、训练方法、推理技术、应用领域等均未在摘要中提及或相关。

!!! tip deepseek-chat TL;DR

该研究通过大规模分析11个LLM在8种文体和4种解码策略下的文本输出，发现模型和文体对机器生成文本风格的影响大于提示和解码策略，并识别了LLM生成文本的关键语言区分特征。

摘要翻译

大型语言模型（LLMs）现已能够生成高度流畅、类人的文本。它们催生了众多应用，同时也引发了大规模垃圾信息、网络钓鱼或学术滥用等担忧。尽管已有大量研究专注于检测LLM生成的文本，但针对人类书写文本与机器生成文本之间风格差异的理解工作仍较为有限。本研究基于道格拉斯·比伯的词汇语法与功能特征集，对人工书写文本与11种LLM生成的文本进行了大规模风格变异分析，涵盖8种不同文体和4种解码策略。我们的发现为LLM的定向使用提供了指导性见解：首先，LLM生成文本的关键语言区分特征对生成条件（例如促使生成类人文本的提示设置，或延续人类写作风格的参考文本的可获得性）表现出较强的稳健性；其次，文体对风格特征的影响比文本来源本身更为显著；第三，模型的聊天变体在风格空间中通常呈现聚集分布；最后，除个别例外情况，模型对风格的影响大于解码策略。这些结果凸显了在塑造机器生成文本风格行为时，模型与文体因素相对于提示策略和解码方法具有更重要的影响。

摘要 (Abstract)

Large Language Models (LLMs) are now capable of generating highly fluent, human-like text. They enable many applications, but also raise concerns such as large scale spam, phishing, or academic misuse. While much work has focused on detecting LLM-generated text, only limited work has gone into understanding the stylistic differences between human-written and machine-generated text. In this work, we perform a large scale analysis of stylistic variation across human-written text and outputs from 11 LLMs spanning 8 different genres and 4 decoding strategies using Douglas Biber’s set of lexicogrammatical and functional features. Our findings reveal insights that can guide intentional LLM usage. First, key linguistic differentiators of LLM-generated text seem robust to generation conditions (e.g., prompt settings to nudge them to generate human-like text, or availability of human-written text to continue the style); second, genre exerts a stronger influence on stylistic features than the source itself; third, chat variants of the models generally appear to be clustered together in stylistic space, and finally, model has a larger effect on the style than decoding strategy, with some exceptions. These results highlight the relative importance of model and genre over prompting and decoding strategies in shaping the stylistic behavior of machine-generated text.

关键词: Large Language Models, LLMs, stylistic variation, human-written text, machine-generated text, decoding strategies, genre, interpretability

16. ❌ MCPThreatHive: Automated Threat Intelligence for Model Context Protocol Ecosystems

作者: Yi Ting Shen, Kentaroh Toyoda, Alex Leung 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13849v1

评分: 15.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文专注于Model Context Protocol (MCP)生态系统中基于代理（agentic）系统的安全威胁情报自动化，与大多数关键词无关。仅与"LLM Agents" OR “Autonomous Agents” OR “Agentic Workflow"高度相关（10分），因为论文核心是MCP-based agentic systems的安全威胁。与"Large Language Models” OR “LLMs” OR “Foundation Models"有一定关联（5分），因为MCP和LLM应用安全框架（如OWASP Top 10 for LLM Applications）被引用，但论文不直接研究LLM技术本身。其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

该论文针对Model Context Protocol (MCP)生态系统中代理系统面临的新型安全威胁，提出了一个名为MCPThreatHive的开源平台，用于自动化威胁情报的端到端生命周期管理，包括数据收集、AI驱动威胁提取分类、知识图谱存储和可视化，并填补了现有工具在组合攻击建模、持续威胁情报和统一分类方面的关键空白。

摘要翻译

基于模型上下文协议（Model Context Protocol, MCP）的智能体系统快速扩散，引入了一类新的安全威胁，而现有框架尚不足以有效应对。本文提出MCPThreatHive，一个开源平台，它自动化了MCP威胁情报的端到端生命周期：从持续、多源的数据采集，通过人工智能驱动的威胁提取与分类，到结构化的知识图谱存储与交互式可视化。该平台实践了MCP-38威胁分类法，这是一套精心整理的包含38种MCP特有威胁模式的集合，并映射至STRIDE、OWASP LLM应用十大风险以及OWASP智能体应用十大风险框架。一个复合风险评分模型提供了量化的优先级排序。通过对现有代表性MCP安全工具的比较分析，我们识别出MCPThreatHive所解决的三个关键覆盖空白：不完整的组合式攻击建模、持续威胁情报的缺失，以及统一多框架分类的缺乏。

摘要 (Abstract)

The rapid proliferation of Model Context Protocol (MCP)-based agentic systems has introduced a new category of security threats that existing frameworks are inadequately equipped to address. We present MCPThreatHive, an open-source platform that automates the end-to-end lifecycle of MCP threat intelligence: from continuous, multi-source data collection through AI-driven threat extraction and classification, to structured knowledge graph storage and interactive visualization. The platform operationalizes the MCP-38 threat taxonomy, a curated set of 38 MCP-specific threat patterns mapped to STRIDE, OWASP Top 10 for LLM Applications, and OWASP Top 10 for Agentic Applications. A composite risk scoring model provides quantitative prioritization. Through a comparative analysis of representative existing MCP security tools, we identify three critical coverage gaps that MCPThreatHive addresses: incomplete compositional attack modeling, absence of continuous threat intelligence, and lack of unified multi-framework classification.

关键词: Model Context Protocol, MCP, threat intelligence, agentic systems, security threats, MCP-38 threat taxonomy, AI-driven threat extraction, knowledge graph

17. ❌ Listening Alone, Understanding Together: Collaborative Context Recovery for Privacy-Aware AI

作者: Tanmay Srivastava, Amartya Basu, Shubham Jain, Vaishnavi Ranganathan 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.13348v1

评分: 10.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	10.0/10	10.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文主要研究隐私保护的异步助手间协作框架（CONCORD），核心是解决始终监听AI的隐私风险，通过实时说话人验证、时空上下文解析、信息间隙检测和关系感知的助手间查询来实现上下文恢复。这与大多数关键词（如LLM技术、训练方法、推理优化、科学AI应用等）完全无关。唯一高度相关的关键词是"Multi-agent Systems” OR “Agent Coordination”，因为论文明确研究助手间（A2A）的协作、协调和协商机制，这是其核心创新点。其他关键词在摘要中均未提及或暗示。

!!! tip deepseek-chat TL;DR

该论文提出了一个名为CONCORD的隐私感知异步助手间协作框架，通过实时说话人验证和助手间安全协商来恢复缺失的上下文，在保护隐私的同时实现了高精度的间隙检测和关系分类，为可社会部署的主动对话代理提供了实用路径。

摘要翻译

我们提出CONCORD——一个隐私感知的异步助理间协作框架，该框架利用主动式语音人工智能之间的协同合作。随着智能体从被动响应型向持续聆听型助理演进，其面临一个核心隐私风险（可能捕获未经同意的说话者语音），这使得它们在社交场景中的部署成为挑战。为克服这一问题，我们实现了CONCORD框架，该框架通过实时说话人验证强制实施仅限设备持有者的语音捕获，生成单边转录记录。这种方法虽会导致语境信息缺失，但有效保护了隐私。我们证明CONCORD能够通过以下机制安全恢复必要语境：（1）时空语境解析，（2）信息间隙检测，以及（3）由关系感知披露机制调控的最小化助理间查询。相较于容易产生幻觉的推断方式，CONCORD将语境恢复视为助理间通过协商实现的安全信息交换。在多领域对话数据集上的实验表明，CONCORD在间隙检测中达到91.4%的召回率，关系分类准确率达96%，在隐私敏感披露决策中真阴性率高达97%。通过将持续聆听人工智能重新定义为隐私保护型智能体间的协同问题，CONCORD为可社会性部署的主动式对话智能体提供了一条可行路径。

摘要 (Abstract)

We introduce CONCORD, a privacy-aware asynchronous assistant-to-assistant (A2A) framework that leverages collaboration between proactive speech-based AI. As agents evolve from reactive to always-listening assistants, they face a core privacy risk (of capturing non-consenting speakers), which makes their social deployment a challenge. To overcome this, we implement CONCORD, which enforces owner-only speech capture via real-time speaker verification, producing a one-sided transcript that incurs missing context but preserves privacy. We demonstrate that CONCORD can safely recover necessary context through (1) spatio-temporal context resolution, (2) information gap detection, and (3) minimal A2A queries governed by a relationship-aware disclosure. Instead of hallucination-prone inferring, CONCORD treats context recovery as a negotiated safe exchange between assistants. Across a multi-domain dialogue dataset, CONCORD achieves 91.4% recall in gap detection, 96% relationship classification accuracy, and 97% true negative rate in privacy-sensitive disclosure decisions. By reframing always-listening AI as a coordination problem between privacy-preserving agents, CONCORD offers a practical path toward socially deployable proactive conversational agents.

关键词: privacy-aware AI, assistant-to-assistant collaboration, context recovery, speaker verification, multi-agent systems, proactive conversational agents, asynchronous framework, relationship-aware disclosure

18. ❌ Driving Engagement in Daily Fantasy Sports with a Scalable and Urgency-Aware Ranking Engine

作者: Unmesh Padalkar 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13796v1

评分: 5.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	5.0/10	5.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文专注于使用Deep Interest Network（DIN）架构构建一个可扩展的、时间敏感的推荐引擎，应用于每日梦幻体育（DFS）领域。核心贡献在于通过实时紧迫性特征和时间位置编码来适应时间敏感的场景，并使用listwise neuralNDCG损失函数进行优化。论文与大多数关键词（如LLMs、MoE、Scaling Laws、Pre-training、RLHF、RAG等）完全无关，因为这些关键词主要涉及大语言模型、模型架构、训练方法、推理优化等，而本文是传统的深度学习推荐系统应用。唯一相关的关键词是"Small Language Models" OR “SLMs” OR “On-device AI”，因为论文提到该系统计划部署在设备端（edge）推荐系统中，这与on-device AI有一定关联，但论文本身不涉及SLMs或语言模型，因此给予5分（有一定关联）。其他关键词如AI for Science等也不相关，因为论文属于商业应用而非科学研究领域。

!!! tip deepseek-chat TL;DR

该论文针对每日梦幻体育中时间敏感的匹配推荐问题，设计并部署了一个基于Deep Interest Network架构的推荐引擎，通过注入时间紧迫性特征和时间位置编码，结合listwise neuralNDCG损失函数，在工业规模数据集上实现了比优化基线高9%的nDCG@1提升，并计划用于设备端推荐系统。

摘要翻译

在日常梦幻体育（DFS）中，比赛参与具有高度的时间敏感性。用户必须在比赛开始前的短暂窗口期内采取行动，这使得比赛推荐成为一项时间关键型任务，以避免错失参与机会和收入损失。现有的推荐系统通常为静态项目目录设计，难以应对此类实时赛事固有的严格时间截止限制。为解决这一问题，我们设计并部署了一个采用深度兴趣网络（Deep Interest Network, DIN）架构的推荐引擎。我们通过两个层面注入时间性来改进DIN架构：首先，为每场候选比赛引入实时紧迫性特征（例如，轮次锁定倒计时）；其次，通过时间位置编码来表征每次历史交互与当前推荐请求之间的时间间隔，使模型能够动态权衡过往行为的时效性。该方法与列表式神经NDCG损失函数相结合，可生成高度相关且具有紧迫性感知的排序结果。为支持工业级规模应用，我们在Ray和PyTorch上开发了多节点、多GPU的训练架构。该系统在包含超过65万用户和超过1000亿次交互的大规模工业数据集上进行了验证，相较于采用人工特征且经过深度优化的LightGBM基线模型，其nDCG@1指标提升了9%。该模型强大的离线性能证明了其作为我们计划中的设备端（边缘）推荐系统核心组件的可行性，后续将进行在线A/B测试。

摘要 (Abstract)

In daily fantasy sports (DFS), match participation is highly time-sensitive. Users must act within a narrow window before a game begins, making match recommendation a time-critical task to prevent missed engagement and revenue loss. Existing recommender systems, typically designed for static item catalogs, are ill-equipped to handle the hard temporal deadlines inherent in these live events. To address this, we designed and deployed a recommendation engine using the Deep Interest Network (DIN) architecture. We adapt the DIN architecture by injecting temporality at two levels: first, through real-time urgency features for each candidate match (e.g., time-to-round-lock), and second, via temporal positional encodings that represent the time-gap between each historical interaction and the current recommendation request, allowing the model to dynamically weigh the recency of past actions. This approach, combined with a listwise neuralNDCG loss function, produces highly relevant and urgency-aware rankings. To support this at industrial scale, we developed a multi-node, multi-GPU training architecture on Ray and PyTorch. Our system, validated on a massive industrial dataset with over 650k users and over 100B interactions, achieves a +9% lift in nDCG@1 over a heavily optimized LightGBM baseline with handcrafted features. The strong offline performance of this model establishes its viability as a core component for our planned on-device (edge) recommendation system, where on-line A/B testing will be conducted.

关键词: daily fantasy sports, recommendation engine, Deep Interest Network, temporal urgency, listwise neuralNDCG, on-device recommendation, industrial scale, time-sensitive ranking

19. ❌ From Feelings to Metrics: Understanding and Formalizing How Users Vibe-Test LLMs

作者: Itay Itzhak, Eliya Habba, Gabriel Stanovsky, Yonatan Belinkov 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.14137v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于LLM评估方法的研究，特别是将用户非正式的“vibe-testing”形式化为系统化的评估流程。论文核心与“Large Language Models OR LLMs OR Foundation Models”高度相关（10分），因为它直接研究LLM的评估问题，并提出了一个概念验证的评估管道。然而，论文并未涉及其他关键词所指向的具体模型架构（如MoE、SLMs）、训练技术（如预训练、微调、对齐、PEFT）、推理优化（如RAG、注意力机制、量化）、高级能力（如推理、代理、工具使用）或特定应用领域（如科学AI）。这些关键词与论文的研究焦点——评估方法论——没有直接关联，因此得分为0。

!!! tip deepseek-chat TL;DR

该论文研究如何将用户非正式评估大语言模型的“vibe-testing”方法形式化，并提出一个结合个性化提示和用户感知评估的验证性流程，实验表明该方法能改变模型偏好，有助于弥合基准测试分数与现实世界体验之间的差距。

摘要翻译

评估大型语言模型（LLM）具有挑战性，因为基准测试分数往往无法反映模型在现实世界中的实际效用。相反，用户常常依赖“感觉测试”（vibe-testing）：这是一种基于经验、非正式的评估方式，例如在与自身工作流程相关的编码任务上比较不同模型。尽管感觉测试普遍存在，但它通常过于临时和非结构化，难以进行大规模分析或复现。在本研究中，我们探讨了感觉测试在实践中的运作方式，并将其形式化以支持系统性分析。我们首先分析了两类实证资源：（1）一项关于用户评估实践的调查，以及（2）从博客和社交媒体收集的真实场景模型比较报告。基于这些资源，我们将感觉测试形式化为一个包含两部分的过程：用户既个性化地选择测试内容，也个性化地判断模型回复。随后，我们引入了一个概念验证评估流程，该流程遵循此形式化框架，通过生成个性化提示（prompts）并采用用户感知的主观标准来比较模型输出。在编码基准测试的实验中，我们发现结合个性化提示和用户感知评估可以改变模型的偏好排序，这反映了感觉测试在实践中的作用。这些结果表明，形式化的感觉测试可以作为一种有效方法，弥合基准测试分数与现实世界经验之间的差距。

摘要 (Abstract)

Evaluating LLMs is challenging, as benchmark scores often fail to capture models’ real-world usefulness. Instead, users often rely on ``vibe-testing’’: informal experience-based evaluation, such as comparing models on coding tasks related to their own workflow. While prevalent, vibe-testing is often too ad hoc and unstructured to analyze or reproduce at scale. In this work, we study how vibe-testing works in practice and then formalize it to support systematic analysis. We first analyze two empirical resources: (1) a survey of user evaluation practices, and (2) a collection of in-the-wild model comparison reports from blogs and social media. Based on these resources, we formalize vibe-testing as a two-part process: users personalize both what they test and how they judge responses. We then introduce a proof-of-concept evaluation pipeline that follows this formulation by generating personalized prompts and comparing model outputs using user-aware subjective criteria. In experiments on coding benchmarks, we find that combining personalized prompts and user-aware evaluation can change which model is preferred, reflecting the role of vibe-testing in practice. These findings suggest that formalized vibe-testing can serve as a useful approach for bridging benchmark scores and real-world experience.

关键词: LLM evaluation, vibe-testing, user evaluation practices, personalized prompts, subjective criteria, model comparison, benchmark scores, real-world usefulness

20. ❌ From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space

作者: Yuqiao Tan, Minzheng Wang, Bo Liu, Zichen Liu, Tian Liang, Shizhu He, Jun Zhao, Kang Liu 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.14142v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM推理能力的增强方法，通过优化预训练空间中的边际分布P(y)来突破传统RLVR方法的限制。与"Large Language Models"高度相关（10分），因为研究基于LLM；与"Pre-training"高度相关（10分），因为提出PreRL直接在预训练空间应用奖励驱动更新；与"Chain of Thought"和"System 2 Thinking"高度相关（10分），因为研究多步推理和深度推理，实验显示反思思维显著增加；与"Self-Correction"高度相关（10分），因为NSR机制驱动自我反思行为。其他关键词如MoE、SLMs、RLHF等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对传统强化学习方法在增强大语言模型推理能力时受限于基础模型输出分布的问题，提出在预训练空间优化边际分布P(y)的PreRL方法，并通过负样本强化机制显著提升推理能力，最终提出的双空间强化学习方法在实验中优于现有基线。

摘要翻译

尽管带可验证奖励的强化学习（RLVR）通过优化条件分布P(y|x)显著增强了大型语言模型的推理能力，但其潜力从根本上受限于基础模型已有的输出分布。在预训练空间中优化边缘分布P(y)通过编码推理能力并保留广泛的探索容量，解决了这一瓶颈。然而，传统的预训练依赖静态语料库进行被动学习，导致分布偏移，从而阻碍了针对性的推理增强。本文提出预训练空间强化学习（PreRL），将奖励驱动的在线更新直接应用于P(y)。我们从理论和实证上验证了log P(y)与log P(y|x)之间的强梯度对齐，确立了PreRL作为标准强化学习的可行替代方案。此外，我们发现了一个关键机制：PreRL中的负样本强化（NSR）是推理能力的异常有效驱动力。NSR-PreRL能快速剪除错误推理空间，同时激发内生的反思行为，使转换思维和反思思维分别提升14.89倍和6.54倍。基于这些发现，我们提出双空间强化学习（DSRL）——一种策略重生策略：首先通过NSR-PreRL初始化模型以扩展推理边界，随后转向标准强化学习进行细粒度优化。大量实验表明，DSRL始终优于强基线方法，证明预训练空间剪枝能有效引导策略朝向精炼的正确推理子空间。

摘要 (Abstract)

While reinforcement learning with verifiable rewards (RLVR) significantly enhances LLM reasoning by optimizing the conditional distribution P(y|x), its potential is fundamentally bounded by the base model’s existing output distribution. Optimizing the marginal distribution P(y) in the Pre-train Space addresses this bottleneck by encoding reasoning ability and preserving broad exploration capacity. Yet, conventional pre-training relies on static corpora for passive learning, leading to a distribution shift that hinders targeted reasoning enhancement. In this paper, we introduce PreRL (Pre-train Space RL), which applies reward-driven online updates directly to P(y). We theoretically and empirically validate the strong gradient alignment between log P(y) and log P(y|x), establishing PreRL as a viable surrogate for standard RL. Furthermore, we uncover a critical mechanism: Negative Sample Reinforcement (NSR) within PreRL serves as an exceptionally effective driver for reasoning. NSR-PreRL rapidly prunes incorrect reasoning spaces while stimulating endogenous reflective behaviors, increasing transition and reflection thoughts by 14.89x and 6.54x, respectively. Leveraging these insights, we propose Dual Space RL (DSRL), a Policy Reincarnation strategy that initializes models with NSR-PreRL to expand the reasoning horizon before transitioning to standard RL for fine-grained optimization. Extensive experiments demonstrate that DSRL consistently outperforms strong baselines, proving that pre-train space pruning effectively steers the policy toward a refined correct reasoning subspace.

关键词: Reinforcement Learning, Pre-train Space, Reasoning Enhancement, Marginal Distribution, Negative Sample Reinforcement, Dual Space RL, Policy Reincarnation, LLM Reasoning

21. ❌ HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

作者: Tianshuo Yang, Guanyu Chen, Yutian Chen, Zhixuan Liang, Yitian Liu, Zanxin Chen, Chunpu Xu, Haotian Liang, Jiangmiao Pang, Yao Mu, Ping Luo 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.14125v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文HiVLA专注于机器人操作领域，提出了一种视觉基础的分层框架，核心是Vision-Language Models (VLMs)和Diffusion Transformer (DiT)在机器人控制中的应用。虽然涉及视觉语言模型，但所有评分关键词都特指大语言模型(LLMs)及其相关技术（如MoE、Scaling Laws、RLHF、RAG等），而论文研究的是视觉语言模型(VLMs)在机器人操作中的具体应用，并未涉及大语言模型的核心技术、训练方法、推理技术、对齐技术、效率优化或科学AI应用。因此，所有关键词与论文内容完全无关，得分为0。

!!! tip deepseek-chat TL;DR

论文HiVLA解决了端到端视觉语言动作模型在机器人操作中微调会损害基础视觉语言模型推理能力的问题，通过提出视觉基础的分层框架，将高层语义规划与低层运动控制解耦，在模拟和真实世界实验中显著优于现有端到端基线方法。

摘要翻译

尽管端到端的视觉-语言-动作（VLA）模型为机器人操作提供了一个前景广阔的范式，但在狭窄的控制数据上对其进行微调往往会损害其从基础视觉-语言模型（VLMs）继承的深度推理能力。为解决这一根本性权衡，我们提出了HiVLA，一个以视觉定位为中心的分层框架，它明确地将高层语义规划与低层运动控制解耦。在高层部分，一个VLM规划器首先执行任务分解和视觉定位，生成结构化计划，包括子任务指令和精确的目标边界框。随后，为了将此计划转化为物理动作，我们在低层部分引入了一个基于流匹配的扩散变换器（DiT）动作专家，该专家配备了一种新颖的级联交叉注意力机制。该设计依次融合全局上下文、高分辨率以目标为中心的图像裁剪块以及技能语义，使得DiT能够纯粹专注于鲁棒执行。我们的解耦架构保留了VLM的零样本推理能力，同时允许两个组件独立改进。在仿真和真实世界中进行的大量实验表明，HiVLA显著优于最先进的端到端基线方法，尤其在长时程技能组合以及杂乱场景中小物体的精细操作方面表现卓越。

摘要 (Abstract)

While end-to-end Vision-Language-Action (VLA) models offer a promising paradigm for robotic manipulation, fine-tuning them on narrow control data often compromises the profound reasoning capabilities inherited from their base Vision-Language Models (VLMs). To resolve this fundamental trade-off, we propose HiVLA, a visual-grounded-centric hierarchical framework that explicitly decouples high-level semantic planning from low-level motor control. In high-level part, a VLM planner first performs task decomposition and visual grounding to generate structured plans, comprising a subtask instruction and a precise target bounding box. Then, to translate this plan into physical actions, we introduce a flow-matching Diffusion Transformer (DiT) action expert in low-level part equipped with a novel cascaded cross-attention mechanism. This design sequentially fuses global context, high-resolution object-centric crops and skill semantics, enabling the DiT to focus purely on robust execution. Our decoupled architecture preserves the VLM’s zero-shot reasoning while allowing independent improvement of both components. Extensive experiments in simulation and the real world demonstrate that HiVLA significantly outperforms state-of-the-art end-to-end baselines, particularly excelling in long-horizon skill composition and the fine-grained manipulation of small objects in cluttered scenes.

关键词: Vision-Language-Action models, robotic manipulation, hierarchical framework, visual grounding, task decomposition, Diffusion Transformer, cascaded cross-attention, long-horizon skill composition

22. ❌ TREX: Automating LLM Fine-tuning via Agent-Driven Tree-based Exploration

作者: Zerun Ma, Guoqiang Wang, Xinchen Xie, Yicheng Chen, He Du, Bowen Li, Yanan Sun, Wenran Liu, Kai Chen, Yining Li 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.14116v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文TREX提出一个多智能体系统，用于自动化LLM微调的全生命周期，包括需求分析、文献数据研究、训练策略制定、数据准备和模型训练评估。核心相关关键词：1）‘Large Language Models OR LLMs OR Foundation Models’（10分）：论文聚焦LLM训练自动化，是核心研究对象；2）‘Post-training OR Supervised Fine-tuning OR SFT’（10分）：论文专门解决LLM微调自动化问题；3）‘LLM Agents OR Autonomous Agents OR Agentic Workflow’（10分）：系统基于AI研究智能体构建；4）‘Multi-agent Systems OR Agent Coordination’（10分）：系统明确为多智能体架构，包含Researcher和Executor模块协作。其他关键词如MoE、量化、推理加速等未在摘要中提及，评0分。

!!! tip deepseek-chat TL;DR

该论文提出了TREX多智能体系统，通过树状探索方法自动化LLM微调的全过程，实验表明该系统能有效优化目标任务的模型性能。

摘要翻译

尽管大语言模型（LLM）已使AI研究智能体能够执行独立的科学任务，但自动化复杂、真实世界的工作流程（例如LLM训练）仍然是一项重大挑战。本文介绍了TREX，一个能够自动化整个LLM训练生命周期的多智能体系统。该系统通过协调两个核心模块——研究者（Researcher）与执行器（Executor）——之间的协作，无缝执行需求分析、开放域文献与数据研究、训练策略制定、数据配方准备以及模型训练与评估。多轮实验过程被建模为一棵搜索树，使系统能够高效规划探索路径、复用历史结果，并从迭代试验中提炼高层见解。为评估自动化LLM训练的能力，我们构建了FT-Bench基准测试，该基准包含10项源自真实场景的任务，范围从优化基础模型能力到提升特定领域任务性能。实验结果表明，TREX智能体能够持续优化模型在目标任务上的性能。

摘要 (Abstract)

While Large Language Models (LLMs) have empowered AI research agents to perform isolated scientific tasks, automating complex, real-world workflows, such as LLM training, remains a significant challenge. In this paper, we introduce TREX, a multi-agent system that automates the entire LLM training life-cycle. By orchestrating collaboration between two core modules-the Researcher and the Executor-the system seamlessly performs requirement analysis, open-domain literature and data research, formulation of training strategies, preparation of data recipes, and model training and evaluation. The multi-round experimental process is modeled as a search tree, enabling the system to efficiently plan exploration paths, reuse historical results, and distill high-level insights from iterative trials. To evaluate the capability of automated LLM training, we construct FT-Bench, a benchmark comprising 10 tasks derived from real-world scenarios, ranging from optimizing fundamental model capabilities to enhancing performance on domain-specific tasks. Experimental results demonstrate that the TREX agent consistently optimizes model performance on target tasks.

关键词: LLM fine-tuning, multi-agent system, automated training, tree-based exploration, training lifecycle, AI research agents, FT-Bench benchmark, model optimization

23. ❌ UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding

作者: Fei Tang, Bofan Chen, Zhengxi Lu, Tongbo Chen, Songqin Nong, Tao Jiang, Wenhao Xu, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.14113v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究GUI grounding任务中的自适应放大方法，核心贡献是提出UI-Zoomer框架，通过不确定性量化来选择性触发和调整放大区域。论文内容专注于计算机视觉和GUI理解领域，涉及模型推理优化、不确定性估计和空间定位技术，但完全不涉及大语言模型、深度学习技术原理创新或AI for Science等关键词。所有关键词均与大模型技术、科学应用或深度学习创新相关，而本文是纯粹的计算机视觉应用研究，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对GUI grounding任务中现有放大方法对所有实例采用统一裁剪的问题，提出了UI-Zoomer框架，通过不确定性量化来选择性触发和自适应调整放大区域，在多个基准数据集上实现了显著的性能提升且无需额外训练。

摘要翻译

GUI接地任务旨在根据自然语言查询从屏幕截图中定位界面元素，对于小型图标和密集布局仍具挑战性。测试时放大方法通过裁剪图像并以更高分辨率重新进行推理来提升定位效果，但现有方法对所有实例采用统一的固定尺寸裁剪，忽略了模型对每个案例是否实际存在不确定性。我们提出\textbf{UI-Zoomer}，一种无需训练的自适应放大框架，将放大操作的触发与尺度均视为预测不确定性的量化问题。置信感知门控模块将随机候选框的空间共识与令牌级生成置信度融合，仅在定位不确定时有选择地触发放大。当触发时，不确定性驱动的裁剪尺寸模块将预测方差分解为样本间位置离散度和样本内边界框扩展度，通过全方差定律推导出每个实例的自适应裁剪半径。在ScreenSpot-Pro、UI-Vision和ScreenSpot-v2数据集上的大量实验表明，该方法在多种模型架构上均能持续超越强基线模型，分别实现最高+13.4%、+10.3%和+4.2%的性能提升，且无需额外训练。

摘要 (Abstract)

GUI grounding, which localizes interface elements from screenshots given natural language queries, remains challenging for small icons and dense layouts. Test-time zoom-in methods improve localization by cropping and re-running inference at higher resolution, but apply cropping uniformly across all instances with fixed crop sizes, ignoring whether the model is actually uncertain on each case. We propose \textbf{UI-Zoomer}, a training-free adaptive zoom-in framework that treats both the trigger and scale of zoom-in as a prediction uncertainty quantification problem. A confidence-aware gate fuses spatial consensus among stochastic candidates with token-level generation confidence to selectively trigger zoom-in only when localization is uncertain. When triggered, an uncertainty-driven crop sizing module decomposes prediction variance into inter-sample positional spread and intra-sample box extent, deriving a per-instance crop radius via the law of total variance. Extensive experiments on ScreenSpot-Pro, UI-Vision, and ScreenSpot-v2 demonstrate consistent improvements over strong baselines across multiple model architectures, achieving gains of up to +13.4%, +10.3%, and +4.2% respectively, with no additional training required.

关键词: GUI grounding, uncertainty quantification, adaptive zoom-in, localization, test-time adaptation, spatial consensus, prediction variance, training-free framework

24. ❌ UMI-3D: Extending Universal Manipulation Interface from Vision-Limited to 3D Spatial Perception

作者: Ziming Wang 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.14089v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文UMI-3D专注于机器人操作中的硬件-软件系统设计，特别是通过集成LiDAR传感器改进视觉SLAM，以增强数据收集的鲁棒性和策略性能。其核心内容涉及机器人感知、传感器融合、数据采集和具身智能，但完全不涉及大语言模型（LLM）、深度学习技术原理、模型训练方法（如预训练、微调、对齐）、推理优化、代理系统或AI在科学领域的应用。所有评分关键词均与大模型或深度学习技术直接相关，而本文研究领域为机器人学与计算机视觉，因此与所有关键词完全无关，均得0分。

!!! tip deepseek-chat TL;DR

UMI-3D通过将低成本LiDAR传感器集成到腕戴式界面中，解决了原始UMI系统在单目视觉SLAM下因遮挡、动态场景和跟踪失败导致的数据收集限制，从而提高了机器人操作任务中数据收集的可靠性和策略性能。

摘要翻译

我们提出UMI-3D，这是通用操作接口（UMI）的多模态扩展，旨在实现具身操作中鲁棒且可扩展的数据收集。尽管UMI支持便携式腕戴数据采集，但其对单目视觉SLAM的依赖使其容易受到遮挡、动态场景和跟踪失败的影响，限制了其在真实环境中的适用性。UMI-3D通过将轻量级低成本激光雷达传感器紧密集成到腕戴式接口中来解决这些局限，实现了以激光雷达为核心的SLAM，能够在挑战性条件下进行精确的度量尺度位姿估计。我们进一步开发了硬件同步的多模态感知流程和统一的时空标定框架，将视觉观测与激光雷达点云对齐，生成一致的任务演示三维表征。尽管保留了原始的二维视觉运动策略框架，UMI-3D显著提升了采集数据的质量与可靠性，这直接转化为策略性能的提升。大量真实世界实验表明，UMI-3D不仅在标准操作任务上实现了高成功率，还能学习对原始纯视觉UMI系统具有挑战性或不可行的任务，包括大尺度可变形物体操作和关节化物体操控。该系统支持从数据采集、对齐、训练到部署的端到端流程，同时保持了原始UMI的便携性与易用性。所有硬件与软件组件均已开源，以促进大规模数据收集并加速具身智能研究：\href{https://umi-3d.github.io}{https://umi-3d.github.io}。

摘要 (Abstract)

We present UMI-3D, a multimodal extension of the Universal Manipulation Interface (UMI) for robust and scalable data collection in embodied manipulation. While UMI enables portable, wrist-mounted data acquisition, its reliance on monocular visual SLAM makes it vulnerable to occlusions, dynamic scenes, and tracking failures, limiting its applicability in real-world environments. UMI-3D addresses these limitations by introducing a lightweight and low-cost LiDAR sensor tightly integrated into the wrist-mounted interface, enabling LiDAR-centric SLAM with accurate metric-scale pose estimation under challenging conditions. We further develop a hardware-synchronized multimodal sensing pipeline and a unified spatiotemporal calibration framework that aligns visual observations with LiDAR point clouds, producing consistent 3D representations of demonstrations. Despite maintaining the original 2D visuomotor policy formulation, UMI-3D significantly improves the quality and reliability of collected data, which directly translates into enhanced policy performance. Extensive real-world experiments demonstrate that UMI-3D not only achieves high success rates on standard manipulation tasks, but also enables learning of tasks that are challenging or infeasible for the original vision-only UMI setup, including large deformable object manipulation and articulated object operation. The system supports an end-to-end pipeline for data acquisition, alignment, training, and deployment, while preserving the portability and accessibility of the original UMI. All hardware and software components are open-sourced to facilitate large-scale data collection and accelerate research in embodied intelligence: \href{https://umi-3d.github.io}{https://umi-3d.github.io}.

关键词: UMI-3D, embodied manipulation, LiDAR-centric SLAM, multimodal sensing, 3D spatial perception, data collection, visuomotor policy, open-source hardware

25. ❌ TIP: Token Importance in On-Policy Distillation

作者: Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang, Alborz Geramifard 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.14084v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于大语言模型（LLMs）的知识蒸馏技术，特别是on-policy蒸馏中的token重要性分析。核心贡献TIP（Token Importance in on-Policy distillation）直接针对LLM训练过程，使用Qwen3、Llama、Qwen2.5等模型进行实验，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文未涉及其他关键词如MoE、SLMs、Scaling Laws、各种训练方法（预训练、微调、对齐等）、推理优化、智能体、量化压缩、科学AI应用等具体技术，这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文研究了在on-policy知识蒸馏中哪些token位置提供最有用的学习信号，发现高学生熵区域以及低学生熵加高师生分歧区域（学生过度自信但错误）的token携带关键信号，并提出了TIP分类法和类型感知的token选择规则，在多个LLM上验证了使用少量token即可达到或超越全token训练的效果。

摘要翻译

在策略知识蒸馏（OPD）中，学生模型在其自身产生的序列上，接受来自教师模型的词元级监督进行训练。并非所有词元位置都同等重要，但现有对词元重要性的理解并不完整。我们提出一个直接的问题：在OPD中，哪些词元携带了最有用的学习信号？我们的答案是：信息丰富的词元来自两个区域：学生模型熵值高的位置，以及学生模型熵值低但师生模型分歧度高的位置——后者对应学生模型过度自信且预测错误的情况。
实证研究表明，学生模型的熵是一个强有力的一阶代理指标：基于熵的采样保留50%的词元进行训练，其效果达到或超过了使用全部词元的训练，同时将峰值内存降低了高达47%。但仅凭熵会遗漏第二个重要区域。当我们单独提取低熵、高分歧的词元时，使用少于全部词元10%的数据进行训练，其效果几乎与全词元基线持平。这表明，尽管这些过度自信的词元在仅考虑熵的规则下几乎不可见，但它们携带了密集的纠正信号。
我们通过TIP（在策略蒸馏中的词元重要性）框架来整合这些发现，这是一个基于学生模型熵和师生模型分歧度的二维分类法，并从理论上解释了为何熵是有用的但在结构上不完整。这一观点启发了结合不确定性与分歧度的、类型感知的词元选择规则。我们在涵盖Qwen3、Llama和Qwen2.5的三个师生模型对上，在MATH-500和AIME 2024/2025数据集上验证了这一观点，并在用于长视野智能体规划的DeepPlanning基准测试中进行了验证。在后者中，仅使用少于20%的词元进行Q3训练，其效果超越了全词元OPD。我们的实验通过扩展OPD代码库（https://github.com/HJSang/OPSD_OnPolicyDistillation）实现，该库支持在有限GPU预算下对更大模型进行内存高效的蒸馏。

摘要 (Abstract)

On-policy knowledge distillation (OPD) trains a student on its own rollouts under token-level supervision from a teacher. Not all token positions matter equally, but existing views of token importance are incomplete. We ask a direct question: which tokens carry the most useful learning signal in OPD? Our answer is that informative tokens come from two regions: positions with high student entropy, and positions with low student entropy plus high teacher–student divergence, where the student is overconfident and wrong. Empirically, student entropy is a strong first-order proxy: retaining $50%$ of tokens with entropy-based sampling matches or exceeds all-token training while reducing peak memory by up to $47%$. But entropy alone misses a second important region. When we isolate low-entropy, high-divergence tokens, training on fewer than $10%$ of all tokens nearly matches full-token baselines, showing that overconfident tokens carry dense corrective signal despite being nearly invisible to entropy-only rules. We organize these findings with TIP (Token Importance in on-Policy distillation), a two-axis taxonomy over student entropy and teacher–student divergence, and give a theoretical explanation for why entropy is useful yet structurally incomplete. This view motivates type-aware token selection rules that combine uncertainty and disagreement. We validate this picture across three teacher–student pairs spanning Qwen3, Llama, and Qwen2.5 on MATH-500 and AIME 2024/2025, and on the DeepPlanning benchmark for long-horizon agentic planning, where Q3-only training on $<$$20%$ of tokens surpasses full-token OPD. Our experiments are implemented by extending the OPD repository https://github.com/HJSang/OPSD_OnPolicyDistillation, which supports memory-efficient distillation of larger models under limited GPU budgets.

关键词: knowledge distillation, on-policy distillation, token importance, student entropy, teacher-student divergence, memory-efficient training, large language models, model compression

26. ❌ First-See-Then-Design: A Multi-Stakeholder View for Optimal Performance-Fairness Trade-Offs

作者: Kavya Gupta, Nektarios Kalampalikis, Christoph Heitz, Isabel Valera 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.14035v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究公平算法决策的多利益相关者框架，基于福利经济学和分配正义理论，关注决策政策设计中的性能-公平权衡。论文内容完全不涉及大模型、深度学习技术原理或AI在科学领域的应用，所有关键词均与大模型技术、训练方法、推理优化、AI代理、科学AI应用等相关，与该论文的公平算法决策主题无任何关联。

!!! tip deepseek-chat TL;DR

该论文提出了一个基于福利经济学和分配正义的多利益相关者框架，将公平算法决策重新定义为决策者效用和社会规划者效用之间的权衡优化问题，并证明在某些条件下随机策略能提供更优的性能-公平权衡。

摘要翻译

算法决策中的公平性通常被定义在预测空间中，其中预测性能——作为决策者效用的代理指标——与基于预测的公平性概念（如人口统计均等或机会均等）进行权衡。然而，这种视角忽略了预测如何转化为决策，并最终转化为决策者和决策对象的效用与福利，以及这些效用和福利在社会显著群体间的分配。
本文基于福利经济学和分配正义理论，提出一个多利益相关者的公平算法决策框架，明确建模决策者和决策对象的效用，并通过社会规划者的效用定义公平性——该效用在不同基于正义的公平概念（例如平等主义、罗尔斯主义）下捕捉了决策对象效用在各群体间的不平等。我们将公平决策表述为一个事后多目标优化问题，在不同决策策略类别（确定性策略与随机性策略、共享策略与群体特定策略）下，于决策者效用和社会规划者效用构成的二维效用空间中刻画可实现的性能-公平权衡。利用所提出的框架，我们进一步识别了随机性策略优于确定性策略的条件（以利益相关者的效用表示），并通过实证证明简单的随机策略能够利用结果不确定性实现更优的性能-公平权衡。总体而言，我们主张从以预测为中心的公平性转向一种透明的、基于正义的多利益相关者方法，以支持决策策略的协同设计。

摘要 (Abstract)

Fairness in algorithmic decision-making is often defined in the predictive space, where predictive performance - used as a proxy for decision-maker (DM) utility - is traded off against prediction-based fairness notions, such as demographic parity or equality of opportunity. This perspective, however, ignores how predictions translate into decisions and ultimately into utilities and welfare for both DM and decision subjects (DS), as well as their allocation across social-salient groups. In this paper, we propose a multi-stakeholder framework for fair algorithmic decision-making grounded in welfare economics and distributive justice, explicitly modeling the utilities of both the DM and DS, and defining fairness via a social planner’s utility that captures inequalities in DS utilities across groups under different justice-based fairness notions (e.g., Egalitarian, Rawlsian). We formulate fair decision-making as a post-hoc multi-objective optimization problem, characterizing the achievable performance-fairness trade-offs in the two-dimensional utility space of DM utility and the social planner’s utility, under different decision policy classes (deterministic vs. stochastic, shared vs. group-specific). Using the proposed framework, we then identify conditions (in terms of the stakeholders’ utilities) under which stochastic policies are more optimal than deterministic ones, and empirically demonstrate that simple stochastic policies can yield superior performance-fairness trade-offs by leveraging outcome uncertainty. Overall, we advocate a shift from prediction-centric fairness to a transparent, justice-based, multi-stakeholder approach that supports the collaborative design of decision-making policies.

关键词: fair algorithmic decision-making, multi-stakeholder framework, welfare economics, distributive justice, performance-fairness trade-offs, stochastic policies, social planner utility, decision policy optimization

27. ❌ Large Language Models to Enhance Business Process Modeling: Past, Present, and Future Trends

作者: João Bettencourt, Sérgio Guerreiro 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.14034v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	8.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文是一篇关于LLMs在业务流程建模中应用的文献综述，明确聚焦于LLMs在文本到BPMN模型转换中的作用，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。摘要中明确提到’Retrieval-Augmented Generation (RAG)‘作为未来研究方向，因此该关键词有一定关联（8分）。论文未涉及其他具体的大模型技术原理（如MoE、量化、注意力机制等）、训练方法（如预训练、微调、对齐等）、推理技术（如思维链、推测解码等）或特定科学领域应用（如生物信息学），因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

这篇文献综述分析了利用大型语言模型将自然语言描述自动转换为业务流程模型（BPMN）的研究现状、方法演变、现有挑战，并指出了包括检索增强生成在内的未来研究方向。

摘要翻译

生成式人工智能，特别是大语言模型（LLM）的最新进展，激发了人们对于使用自然语言自动化或辅助业务流程建模任务日益增长的兴趣。目前已有多种方法被提出，用于将文本化流程描述转化为BPMN及相关工作流模型。然而，这些方法在多大程度上能有效支持组织环境中的复杂流程建模，目前尚不明确。本文献综述对将自然语言转化为BPMN流程模型的AI驱动方法进行了回顾，特别聚焦于LLM在其中扮演的角色。遵循结构化的综述策略，我们识别并分析了相关研究，以对现有方法进行分类，审视LLM如何被集成到文本到模型的流程中，并探究用于评估生成模型的实践做法。分析揭示了一个明显的趋势，即从基于规则和传统自然语言处理（NLP）流程向基于LLM的架构转变，后者依赖于提示工程、中间表示和迭代优化机制。尽管这些方法显著扩展了自动化流程模型生成的能力，但文献也揭示了持续存在的挑战，涉及语义正确性、评估碎片化、可复现性以及在真实组织环境中验证有限等问题。基于这些发现，本综述指出了关键的研究空白，并讨论了未来研究的有前景方向，包括通过检索增强生成（RAG）整合上下文知识、其与LLM的集成、交互式建模架构的开发，以及对更全面和标准化评估框架的需求。

摘要 (Abstract)

Recent advances in Generative Artificial Intelligence, particularly Large Language Models (LLMs), have stimulated growing interest in automating or assisting Business Process Modeling tasks using natural language. Several approaches have been proposed to transform textual process descriptions into BPMN and related workflow models. However, the extent to which these approaches effectively support complex process modeling in organizational settings remains unclear. This article presents a literature review of AI-driven methods for transforming natural language into BPMN process models, with a particular focus on the role of LLMs. Following a structured review strategy, relevant studies were identified and analyzed to classify existing approaches, examine how LLMs are integrated into text-to-model pipelines, and investigate the evaluation practices used to assess generated models. The analysis reveals a clear shift from rule-based and traditional NLP pipelines toward LLM-based architectures that rely on prompt engineering, intermediate representations, and iterative refinement mechanisms. While these approaches significantly expand the capabilities of automated process model generation, the literature also exposes persistent challenges related to semantic correctness, evaluation fragmentation, reproducibility, and limited validation in real-world organizational contexts. Based on these findings, this review identifies key research gaps and discusses promising directions for future research, including the integration of contextual knowledge through Retrieval-Augmented Generation (RAG), its integration with LLMs, the development of interactive modeling architectures, and the need for more comprehensive and standardized evaluation frameworks.

关键词: Large Language Models, Business Process Modeling, BPMN, Text-to-Model, Literature Review, Retrieval-Augmented Generation, Generative AI, Workflow Models

28. ❌ Hierarchical Reinforcement Learning with Runtime Safety Shielding for Power Grid Operation

作者: Gitesh Malik 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.14032v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于电力系统控制中的强化学习安全框架，未涉及大语言模型、深度学习技术原理或任何评分关键词中的具体技术（如MoE、SFT、RAG等）。唯一的相关性在于“AI for Science”关键词，因为论文将AI（强化学习）应用于电力系统（科学/工程领域），但并非核心内容，因此给5分（有一定关联）。其他关键词均与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种结合强化学习和运行时安全屏蔽的分层控制框架，用于电力系统操作，以解决安全约束和泛化问题，并在基准测试中展示了优于传统方法的性能和零样本泛化能力。

摘要翻译

强化学习在电网运行任务（如拓扑控制与阻塞管理）的自动化方面展现出潜力。然而，其在真实电力系统中的部署仍受限于严格的安全要求、在罕见扰动下的脆弱性以及对未见电网拓扑的泛化能力不足。在安全关键的基础设施中，灾难性故障是不可接受的，基于学习的控制器必须在严格的物理约束下运行。
本文提出了一种用于电网运行的安全约束分层控制框架，该框架明确地将长周期决策与实时可行性执行解耦。高层强化学习策略提出抽象控制动作，而一个确定性的运行时安全屏障则通过快速前向仿真过滤不安全动作。安全作为运行时不变性被强制执行，独立于策略质量或训练数据分布。
所提出的框架在Grid2Op基准测试套件上进行了评估，测试场景包括正常运行条件、强制线路停运压力测试，以及在未经重新训练的情况下零样本部署于ICAPS 2021大规模输电网络。结果表明，扁平结构的强化学习策略在压力下表现脆弱，而纯安全方法则过于保守。相比之下，所提出的分层且具备安全感知的方法实现了更长的运行周期存活时间、更低的线路负载峰值，并对未见电网具有鲁棒的零样本泛化能力。
这些结果表明，电网控制中的安全性与泛化能力最好通过架构设计来实现，而非依赖日益复杂的奖励函数工程。这为现实能源系统中可部署的基于学习的控制器提供了一条实用路径。

摘要 (Abstract)

Reinforcement learning has shown promise for automating power-grid operation tasks such as topology control and congestion management. However, its deployment in real-world power systems remains limited by strict safety requirements, brittleness under rare disturbances, and poor generalization to unseen grid topologies. In safety-critical infrastructure, catastrophic failures cannot be tolerated, and learning-based controllers must operate within hard physical constraints. This paper proposes a safety-constrained hierarchical control framework for power-grid operation that explicitly decouples long-horizon decision-making from real-time feasibility enforcement. A high-level reinforcement learning policy proposes abstract control actions, while a deterministic runtime safety shield filters unsafe actions using fast forward simulation. Safety is enforced as a runtime invariant, independent of policy quality or training distribution. The proposed framework is evaluated on the Grid2Op benchmark suite under nominal conditions, forced line-outage stress tests, and zero-shot deployment on the ICAPS 2021 large-scale transmission grid without retraining. Results show that flat reinforcement learning policies are brittle under stress, while safety-only methods are overly conservative. In contrast, the proposed hierarchical and safety-aware approach achieves longer episode survival, lower peak line loading, and robust zero-shot generalization to unseen grids. These results indicate that safety and generalization in power-grid control are best achieved through architectural design rather than increasingly complex reward engineering, providing a practical path toward deployable learning-based controllers for real-world energy systems.

关键词: reinforcement learning, safety shielding, power grid operation, hierarchical control, runtime safety, zero-shot generalization, Grid2Op benchmark

29. ❌ Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective

作者: Weijie Wang, Qihang Cao, Sensen Gao, Donny Y. Chen, Haofei Xu, Wenjing Bian, Songyou Peng, Tat-Jen Cham, Chuanxia Zheng, Andreas Geiger, Jianfei Cai, Jia-Wang Bian, Bohan Zhuang 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.14025v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	5.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文是关于计算机视觉中前馈式3D场景建模的综述，主要关注3D重建的模型设计、架构模式和实际应用。论文内容与绝大多数关键词（涉及大语言模型、训练技术、推理优化、对齐、代理等）完全无关。唯一略有相关的是’World Models AND General World Models’，因为论文提到了’world modeling’作为未来挑战之一，但这不是论文的核心内容，只是未来方向中的一句话提及，因此给5分（有一定关联）。其他所有关键词均未在标题或摘要中出现，与论文主题无直接联系。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于模型设计策略的前馈式3D场景建模新分类法，将研究分为特征增强、几何感知、模型效率、增强策略和时间感知模型五个关键问题，并综述了相关基准、数据集和实际应用。

摘要翻译

从二维输入重建三维表征是计算机视觉与图形学领域的一项基础任务，为理解物理世界并与之交互奠定了基石。尽管传统方法能够实现高保真度，但其受限于缓慢的逐场景优化或特定类别的训练，这阻碍了实际部署与可扩展性。因此，可泛化的前馈式三维重建在近年来得到了快速发展。通过学习将图像直接映射到三维表征的模型，这些方法能够通过单次前向传播实现高效重建，并具备强大的跨场景泛化能力。本综述基于一个关键观察：尽管现有前馈方法在几何输出表征上存在差异（从隐式场到显式基元），但它们共享相似的高层架构模式，如图像特征提取主干、多视图信息融合机制以及几何感知设计原则。因此，我们抽象掉这些表征差异，转而聚焦于模型设计，提出一种新颖的分类法，该分类法以不依赖于输出格式的模型设计策略为核心。我们提出的分类法将研究方向归纳为驱动近期研究发展的五个关键问题：特征增强、几何感知、模型效率、数据增强策略以及时序感知模型。为了以实证基础和标准化评估支撑这一分类法，我们进一步全面回顾了相关基准测试与数据集，并基于前馈式三维模型对现实应用进行了广泛讨论与分类。最后，我们展望了未来研究方向，以应对可扩展性、评估标准与世界建模等开放挑战。

摘要 (Abstract)

Reconstructing 3D representations from 2D inputs is a fundamental task in computer vision and graphics, serving as a cornerstone for understanding and interacting with the physical world. While traditional methods achieve high fidelity, they are limited by slow per-scene optimization or category-specific training, which hinders their practical deployment and scalability. Hence, generalizable feed-forward 3D reconstruction has witnessed rapid development in recent years. By learning a model that maps images directly to 3D representations in a single forward pass, these methods enable efficient reconstruction and robust cross-scene generalization. Our survey is motivated by a critical observation: despite the diverse geometric output representations, ranging from implicit fields to explicit primitives, existing feed-forward approaches share similar high-level architectural patterns, such as image feature extraction backbones, multi-view information fusion mechanisms, and geometry-aware design principles. Consequently, we abstract away from these representation differences and instead focus on model design, proposing a novel taxonomy centered on model design strategies that are agnostic to the output format. Our proposed taxonomy organizes the research directions into five key problems that drive recent research development: feature enhancement, geometry awareness, model efficiency, augmentation strategies and temporal-aware models. To support this taxonomy with empirical grounding and standardized evaluation, we further comprehensively review related benchmarks and datasets, and extensively discuss and categorize real-world applications based on feed-forward 3D models. Finally, we outline future directions to address open challenges such as scalability, evaluation standards, and world modeling.

关键词: feed-forward 3D reconstruction, 3D scene modeling, model design taxonomy, geometry-aware design, multi-view fusion, feature enhancement, model efficiency, world modeling

30. ❌ Towards Multi-Object-Tracking with Radar on a Fast Moving Vehicle: On the Potential of Processing Radar in the Frequency Domain

作者: Tim Hansen, Arturo Gomez-Chavez, Ilya Shimchik, Andreas Birk 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.14013v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究雷达数据处理和自动驾驶中的多目标跟踪，专注于频率域处理、雷达测程和噪声鲁棒性，所有关键词均涉及大模型、深度学习、AI科学应用或相关技术原理，而本文完全不涉及这些主题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了在快速移动车辆上使用频率域处理雷达数据以实现多目标跟踪，通过FS2D方法和Boreas数据集验证了雷达测程的鲁棒性。

摘要翻译

本文提倡在频域处理雷达数据，以实现更高的抗噪声和结构误差鲁棒性，尤其相较于基于特征的方法。这一优势同样适用于高动态场景，即搭载传感器的车辆自身运动与未知数量的其他运动物体共存的情况。除高鲁棒性外，频域处理还具有一个迄今被忽视的优点：其所基于的相关性方法（例如用于配准的方法）能够提供场景中所有运动结构的信息。典型的汽车应用案例是超车操作，本文以自动驾驶赛车场景中的超车作为示例进行说明。我们展示了二维傅里叶软匹配（FS2D）的初步实验与结果，该实验使用Boreas数据集演示了纯雷达里程计（即无需多传感器融合的雷达里程计）的性能，以支撑本文论点。

摘要 (Abstract)

We promote in this paper the processing of radar data in the frequency domain to achieve higher robustness against noise and structural errors, especially in comparison to feature-based methods. This holds also for high dynamics in the scene, i.e., ego-motion of the vehicle with the sensor plus the presence of an unknown number of other moving objects. In addition to the high robustness, the processing in the frequency domain has the so far neglected advantage that the underlying correlation based methods used for, e.g., registration, provide information about all moving structures in the scene. A typical automotive application case is overtaking maneuvers, which in the context of autonomous racing are used here as a motivating example. Initial experiments and results with Fourier SOFT in 2D (FS2D) are presented that use the Boreas dataset to demonstrate radar-only-odometry, i.e., radar-odometry without sensor-fusion, to support our arguments.

关键词: radar data processing, frequency domain, multi-object tracking, autonomous racing, radar odometry, Boreas dataset, FS2D, robustness

31. ❌ MAny: Merge Anything for Multimodal Continual Instruction Tuning

作者: Zijian Gao, Wangwang Jia, Xingxing Zhang, Pengfei Qian, Tao Sun, Bo Ding, Yong Dou, Huaimin Wang, Kele Xu 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.14016v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	10.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究多模态大语言模型（MLLMs）的持续指令微调（Continual Instruction Tuning）中的灾难性遗忘问题，并提出了一种无需额外训练、基于代数运算的模型合并（Model Merging）框架MAny。因此，与’Large Language Models’、‘Instruction Tuning’、‘PEFT’（涉及低秩参数空间操作）和’Model Merging’高度相关（10分）。与’Pre-training’和’Post-training/SFT’有一定关联（8分），因为论文背景涉及初始微调（initial tuning）和持续任务适应。其他关键词如MoE、SLMs、RAG、推理方法、代理、压缩等均未在摘要中提及或直接相关，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型持续指令微调中的双遗忘问题，提出了一个无需训练的MAny框架，通过跨模态投影合并和低秩参数合并来有效融合任务特定知识，在多个基准上显著提升了性能并减少了遗忘。

摘要翻译

多模态持续指令微调（MCIT）对于多模态大语言模型（MLLMs）的序列任务适应至关重要，但受到灾难性遗忘的严重制约。现有研究主要关注推理语言主干，而本工作揭示了一个关键但被忽视的双重遗忘现象，即跨模态投影空间中的感知漂移与低秩参数空间中的推理崩溃。为解决此问题，我们提出 MAny（Merge Anything）框架，该框架通过跨模态投影合并（Cross-modal Projection Merging, CPM）和低秩参数合并（Low-rank Parameter Merging, LPM）来融合任务特定知识。具体而言，CPM 通过视觉原型引导自适应地合并跨模态视觉表征，以恢复感知对齐，确保推理过程中特征的准确重建。同时，LPM 通过递归合并低秩权重矩阵，消除任务特定低秩模块间的相互干扰。通过利用递归最小二乘法，LPM 提供了一个闭式解，从数学上保证了推理稳定性的最优融合轨迹。值得注意的是，MAny 作为一种免训练范式，通过基于 CPU 的高效代数运算实现知识融合，无需在初始微调之外进行额外的基于梯度的优化。我们广泛的评估证实了 MAny 在多种 MLLMs 和基准测试中的卓越性能与鲁棒性。具体而言，在 UCIT 基准测试中，MAny 在两种不同 MLLMs 上相比最先进方法，最终平均准确率分别取得了高达 8.57% 和 2.85% 的显著领先优势。

摘要 (Abstract)

Multimodal Continual Instruction Tuning (MCIT) is essential for sequential task adaptation of Multimodal Large Language Models (MLLMs) but is severely restricted by catastrophic forgetting. While existing literature focuses on the reasoning language backbone, in this work, we expose a critical yet neglected dual-forgetting phenomenon across both perception drift in Cross-modal Projection Space and reasoning collapse in Low-rank Parameter Space. To resolve this, we present \textbf{MAny} (\textbf{M}erge \textbf{Any}thing), a framework that merges task-specific knowledge through \textbf{C}ross-modal \textbf{P}rojection \textbf{M}erging (\textbf{CPM}) and \textbf{L}ow-rank \textbf{P}arameter \textbf{M}erging (\textbf{LPM}). Specifically, CPM recovers perceptual alignment by adaptively merging cross-modal visual representations via visual-prototype guidance, ensuring accurate feature recovery during inference. Simultaneously, LPM eliminates mutual interference among task-specific low-rank modules by recursively merging low-rank weight matrices. By leveraging recursive least squares, LPM provides a closed-form solution that mathematically guarantees an optimal fusion trajectory for reasoning stability. Notably, MAny operates as a training-free paradigm that achieves knowledge merging via efficient CPU-based algebraic operations, eliminating additional gradient-based optimization beyond initial tuning. Our extensive evaluations confirm the superior performance and robustness of MAny across multiple MLLMs and benchmarks. Specifically, on the UCIT benchmark, MAny achieves significant leads of up to 8.57% and 2.85% in final average accuracy over state-of-the-art methods across two different MLLMs, respectively.

关键词: Multimodal Large Language Models, Continual Instruction Tuning, Catastrophic Forgetting, Model Merging, Cross-modal Projection, Low-rank Parameters, Training-free Paradigm, Knowledge Fusion

32. ❌ Memory Transfer Learning: How Memories are Transferred Across Domains in Coding Agents

作者: Kangsan Kim, Minki Kang, Taeil Kim, Yanlai Yang, Mengye Ren, Sung Ju Hwang 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.14004v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是编码代理（coding agents）中的记忆迁移学习（Memory Transfer Learning），核心是跨异构任务领域利用统一记忆池来提升性能。虽然涉及AI代理（agents）和记忆机制，但论文未明确提及或聚焦于大语言模型（LLMs）、深度学习技术原理创新或关键词列表中的具体技术（如MoE、SFT、RAG、CoT等）。所有关键词均与论文内容无直接关联，因此相关度评分均为0。

!!! tip deepseek-chat TL;DR

该论文研究了编码代理中跨异构领域的记忆迁移学习，发现通过统一记忆池转移元知识（如验证例程）可平均提升性能3.7%，且抽象程度高的记忆泛化更好，而具体痕迹可能导致负迁移。

摘要翻译

基于记忆的自我进化已成为编码智能体的一种前景广阔的研究范式。然而，现有方法通常将记忆的利用限制在单一任务领域内，未能充分利用现实世界中多样化编码问题之间共享的基础设施，例如运行时环境和编程语言。为应对这一局限，我们通过利用来自异构领域的统一记忆池，研究了\textbf{记忆迁移学习}（Memory Transfer Learning, MTL）。我们使用四种记忆表示（从具体执行轨迹到抽象见解）在6个编码基准上评估了性能。实验表明，跨领域记忆可将平均性能提升3.7%，这主要得益于元知识（如验证例程）的迁移，而非特定任务代码的迁移。重要的是，我们发现抽象程度决定了可迁移性：高层次见解泛化能力良好，而低层次轨迹则常因过度具体化而导致负迁移。此外，我们证明了迁移效果随记忆池规模扩大而提升，并且记忆甚至可以在不同模型之间进行迁移。我们的工作为将记忆利用扩展至单一领域之外建立了实证设计原则。项目页面：https://memorytransfer.github.io/

摘要 (Abstract)

Memory-based self-evolution has emerged as a promising paradigm for coding agents. However, existing approaches typically restrict memory utilization to homogeneous task domains, failing to leverage the shared infrastructural foundations, such as runtime environments and programming languages, that exist across diverse real-world coding problems. To address this limitation, we investigate \textbf{Memory Transfer Learning} (MTL) by harnessing a unified memory pool from heterogeneous domains. We evaluate performance across 6 coding benchmarks using four memory representations, ranging from concrete traces to abstract insights. Our experiments demonstrate that cross-domain memory improves average performance by 3.7%, primarily by transferring meta-knowledge, such as validation routines, rather than task-specific code. Importantly, we find that abstraction dictates transferability; high-level insights generalize well, whereas low-level traces often induce negative transfer due to excessive specificity. Furthermore, we show that transfer effectiveness scales with the size of the memory pool, and memory can be transferred even between different models. Our work establishes empirical design principles for expanding memory utilization beyond single-domain silos. Project page: https://memorytransfer.github.io/

关键词: Memory Transfer Learning, Coding Agents, Heterogeneous Domains, Memory Pool, Meta-knowledge Transfer, Abstraction, Cross-domain Performance, Negative Transfer

33. ❌ Diffusion Language Models for Speech Recognition

作者: Davyd Naveriani, Albert Zeyer, Ralf Schlüter, Hermann Ney 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.14001v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究扩散语言模型在语音识别中的应用，包括掩码扩散语言模型和均匀状态扩散模型用于ASR假设重打分，以及结合CTC和USDM的联合解码方法。所有评分关键词均与大模型和深度学习技术原理相关，但论文专注于扩散语言模型这一特定类型的语言模型，未涉及任何评分关键词中列出的具体技术（如LLMs、MoE、SFT、RLHF、RAG、CoT、量化等），也未涉及AI for Science等应用领域。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文探索了扩散语言模型在语音识别中的应用，提出了使用掩码扩散语言模型和均匀状态扩散模型进行ASR假设重打分的方法，并设计了一种结合CTC和USDM的联合解码方法，显著提高了语音识别的准确性。

摘要翻译

扩散语言模型凭借其双向注意力机制与并行文本生成能力，近期已成为标准语言模型的重要替代方案。本研究探索了该模型在语音识别中的应用变体。具体而言，我们提出了一个完整指南，将掩码扩散语言模型（Masked Diffusion Language Model, MDLM）与均匀状态扩散模型（Uniform-State Diffusion Model, USDM）用于自动语音识别（ASR）假设的重打分任务。此外，我们设计了一种新的联合解码方法，通过在每个解码步骤中将来自CTC的帧级别概率分布与USDM计算的标签级别概率分布相结合，实现了CTC与USDM的融合，从而生成能够同时融合USDM强大语言知识与CTC声学信息的新候选序列。实验结果表明，USDM与MDLM均能显著提升识别文本的准确率。我们已公开全部代码与实施方案。

摘要 (Abstract)

Diffusion language models have recently emerged as a leading alternative to standard language models, due to their ability for bidirectional attention and parallel text generation. In this work, we explore variants for their use in speech recognition. Specifically, we introduce a comprehensive guide to incorporating masked diffusion language models (MDLM) and uniform-state diffusion models (USDMs) for rescoring ASR hypotheses. Additionally, we design a new joint-decoding method that combines CTC and USDM by integrating the framewise probability distributions derived from CTC with the labelwise probability distributions computed by USDM at each decoding step, thereby generating new candidates that combine strong language knowledge from USDM and acoustic information from CTC. Our findings reveal that USDM, as well as MDLM, can significantly improve the accuracy of recognized text. We publish all our code and recipes.

关键词: Diffusion Language Models, Speech Recognition, ASR Rescoring, Masked Diffusion Language Models, Uniform-State Diffusion Models, Joint Decoding, CTC, Acoustic Information

34. ❌ Reward Design for Physical Reasoning in Vision-Language Models

作者: Derek Lilienthal, Manisha Mukherjee, Sameera Horawalavithana 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13993v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文研究奖励设计对基于GRPO的视觉语言模型（VLM）物理推理训练的影响，核心涉及监督微调（SFT）和GRPO等后训练方法，与’Post-training OR Supervised Fine-tuning OR SFT’高度相关（10分）。研究物理推理问题，属于’AI for Science’范畴（10分）。论文涉及多步推理和深度推理，与’Chain of Thought’和’System 2 Thinking’有一定关联（各5分）。使用注意力权重分析模型行为，与’Mechanistic Interpretability’相关（5分）。论文聚焦VLM而非纯LLM，与’Large Language Models’有中等关联（5分）。其他关键词如MoE、量化、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了奖励设计如何影响基于GRPO的视觉语言模型在物理推理任务中的表现，发现不同奖励信号会诱导领域特定的推理行为，其中基于准确性的奖励提供最强整体性能提升，而基于注意力的内部奖励无需空间标注即可显著提高空间关系准确性。

摘要翻译

对视觉输入进行物理推理需要紧密整合视觉感知、领域知识与多步符号推理。然而，即使是最先进的视觉语言模型（VLMs）在物理基准测试上的表现仍远逊于人类。尽管监督微调（SFT）和组相对策略优化（GRPO）等训练后算法已在语言模型中展现出显著的推理能力提升，但奖励设计如何影响VLM的物理推理行为仍不甚明晰。本文针对基于GRPO的VLM物理推理训练进行了系统的奖励消融研究。我们比较了四种语义丰富度递增的奖励信号：格式合规性、答案准确性、复合规则奖励（答案正确性、物理原理识别和单位一致性），以及一种新颖的内部奖励——该奖励源自模型对输入图像区域的注意力权重。我们在PhyX基准（包含3000个问题，涵盖六个物理领域和六种推理类型，包含选择题与开放式题型）上使用IBM Granite Vision 3.3（2B）模型进行评估。在所有题型中，基于准确性奖励的GRPO在多数领域上优于SFT，但提升幅度因奖励类型和领域差异显著。奖励设计并非均匀提升性能，而是会诱发领域特定的推理行为：基于准确性的奖励带来最全面的整体提升；规则奖励改善了结构化推理质量，但未持续提高准确性；基于注意力的奖励增强了空间推理能力，却在符号推理领域导致性能下降。我们提出的内部注意力权重奖励无需空间标注，将空间关系准确性从0.27提升至0.50，这表明在生成过程中监督模型的关注区域是视觉基础物理推理的一个有前景的研究方向。

摘要 (Abstract)

Physical reasoning over visual inputs demands tight integration of visual perception, domain knowledge, and multi-step symbolic inference. Yet even state-of-the-art Vision Language Models (VLMs) fall far short of human performance on physics benchmarks. While post-training algorithms such as Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO) have demonstrated strong reasoning gains in language models, how reward design shapes VLM physical reasoning behavior remains poorly understood. We present a systematic reward ablation study for GRPO-based VLM training on physical reasoning. We compare four reward signals of increasing semantic richness: format compliance, answer accuracy, a composite rubric reward (answer correctness, physics principle identification, and unit consistency), and a novel internal reward derived from model attention weights over input image regions. We evaluate on PhyX, a 3,000-problem benchmark spanning six physics domains and six reasoning types across multiple-choice and open-ended formats, using IBM Granite Vision 3.3 (2B). Across both formats, GRPO with accuracy-based rewards outperforms SFT on most domains, though gains vary substantially by reward type and domain. Reward design does not uniformly improve performance. Instead, it induces domain-specific reasoning behaviors. Accuracy-based rewards provide the strongest overall gains. Rubric rewards improve structured reasoning quality without consistent accuracy improvements. Attention-based rewards enhance spatial reasoning while degrading performance in symbolic domains. Our internal attention-weight reward requires no spatial annotations and improves spatial relation accuracy from 0.27 to 0.50, suggesting that supervising where the model attends during generation is a promising direction for visually grounded physical reasoning.

关键词: Reward Design, Physical Reasoning, Vision-Language Models, GRPO, Supervised Fine-Tuning, Attention Weights, PhyX Benchmark, Spatial Reasoning

35. ❌ Leveraging LLM-GNN Integration for Open-World Question Answering over Knowledge Graphs

作者: Hussein Abdallah, Ibrahim Abdelaziz, Panos Kalnis, Essam Mansour 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13979v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是LLM与GNN的集成应用，因此与’Large Language Models’高度相关（10分）。论文涉及LLM的推理过程，与’Chain of Thought’和’System 2 Thinking’有一定关联（各5分），但非核心焦点。其他关键词如MoE、SLMs、训练技术、优化方法、特定应用领域等均未在摘要中提及，故评0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为GLOW的混合系统，通过集成预训练的图神经网络和大型语言模型来解决知识图谱上的开放世界问答问题，并在新基准测试中显著优于现有方法。

摘要翻译

知识图谱开放世界问答旨在对不完整或动态演化的知识图谱进行问答。传统知识图谱问答基于封闭世界假设，要求答案必须存在于图谱中，这限制了其实际应用范围。相比之下，开放世界问答需要依据图谱结构与上下文推理缺失知识。大语言模型擅长语言理解但缺乏结构化推理能力；图神经网络能建模图拓扑结构却难以进行语义解析。现有系统通常将大语言模型与图神经网络或图检索器结合：部分支持开放世界问答，但仅依赖结构嵌入而缺乏语义锚定；多数方法假设观测路径完整或图谱完备，导致其在缺失链接或多跳推理场景下可靠性不足。本文提出GLOW混合系统，通过结合预训练图神经网络与大语言模型实现开放世界知识图谱问答。图神经网络首先基于图结构预测前k个候选答案，这些候选答案及相关知识图谱事实（如三元组）被序列化为结构化提示，以引导大语言模型进行推理。该方法实现了符号信号与语义信号的联合推理，且无需依赖检索或微调。为评估泛化能力，我们构建了GLOW-BENCH基准数据集，包含跨领域不完整知识图谱上的1000个问题。实验表明，GLOW在标准基准和GLOW-BENCH上均优于现有大语言模型-图神经网络系统，最高提升达53.3%，平均提升38%。相关代码与数据已在GitHub开源。

摘要 (Abstract)

Open-world Question Answering (OW-QA) over knowledge graphs (KGs) aims to answer questions over incomplete or evolving KGs. Traditional KGQA assumes a closed world where answers must exist in the KG, limiting real-world applicability. In contrast, open-world QA requires inferring missing knowledge based on graph structure and context. Large language models (LLMs) excel at language understanding but lack structured reasoning. Graph neural networks (GNNs) model graph topology but struggle with semantic interpretation. Existing systems integrate LLMs with GNNs or graph retrievers. Some support open-world QA but rely on structural embeddings without semantic grounding. Most assume observed paths or complete graphs, making them unreliable under missing links or multi-hop reasoning. We present GLOW, a hybrid system that combines a pre-trained GNN and an LLM for open-world KGQA. The GNN predicts top-k candidate answers from the graph structure. These, along with relevant KG facts, are serialized into a structured prompt (e.g., triples and candidates) to guide the LLM’s reasoning. This enables joint reasoning over symbolic and semantic signals, without relying on retrieval or fine-tuning. To evaluate generalization, we introduce GLOW-BENCH, a 1,000-question benchmark over incomplete KGs across diverse domains. GLOW outperforms existing LLM-GNN systems on standard benchmarks and GLOW-BENCH, achieving up to 53.3% and an average 38% improvement. GitHub code and data are available.

关键词: Large Language Models, Graph Neural Networks, Open-world Question Answering, Knowledge Graphs, Hybrid System, Reasoning, Benchmark, GLOW

36. ❌ How Can We Synthesize High-Quality Pretraining Data? A Systematic Study of Prompt Design, Generator Model, and Source Data

作者: Joel Niklaus, Atsuki Yamaguchi, Michal Štefánik, Guilherme Penedo, Hynek Kydlíček, Elie Bakouch, Lewis Tunstall, Edward Emanuel Beeching, Thibaud Frere, Colin Raffel, Leandro von Werra, Thomas Wolf 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13977v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究如何通过提示设计、生成模型和源数据选择来合成高质量的预训练数据，直接涉及大语言模型（LLMs）和预训练（Pre-training）技术，因此这两个关键词得10分。论文探讨了数据质量对模型性能的影响，与’Scaling Laws AND Data Quality’有一定关联，得5分。其他关键词如MoE、SFT、RLHF、RAG等均未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文系统研究了如何通过优化提示设计、生成模型和源数据选择来合成高质量的预训练数据，发现结构化输出格式（如表格、数学问题）优于现有方法，并发布了FinePhrase数据集，在提升性能的同时降低了生成成本。

摘要翻译

合成数据是训练大语言模型的标准组成部分，然而针对包括改写策略、生成模型和源数据在内的设计维度，目前仍缺乏系统性的比较研究。我们进行了大规模的受控实验，生成了超过一万亿词元，以识别将网络文本改写为合成预训练数据的关键因素。我们的结果表明，结构化输出格式（如表格、数学问题、常见问答和教程）在性能上持续优于精选的网络基线数据和先前的合成方法。值得注意的是，将生成模型的规模增加到超过10亿参数并未带来额外收益。我们的分析还表明，用于混合的原始数据选择对性能有显著影响。基于这些发现，我们开发了 \textbf{\textsc{FinePhrase}}——一个包含4860亿词元的、由改写网络文本构成的开放数据集。我们证明 \textsc{FinePhrase} 在所有现有合成数据基线中表现最优，同时将生成成本降低了高达30倍。我们向研究社区公开提供该数据集、所有提示词以及生成框架。

摘要 (Abstract)

Synthetic data is a standard component in training large language models, yet systematic comparisons across design dimensions, including rephrasing strategy, generator model, and source data, remain absent. We conduct extensive controlled experiments, generating over one trillion tokens, to identify critical factors in rephrasing web text into synthetic pretraining data. Our results reveal that structured output formats, such as tables, math problems, FAQs, and tutorials, consistently outperform both curated web baselines and prior synthetic methods. Notably, increasing the size of the generator model beyond 1B parameters provides no additional benefit. Our analysis also demonstrates that the selection of the original data used for mixing substantially influences performance. By applying our findings, we develop \textbf{\textsc{FinePhrase}}, a 486-billion-token open dataset of rephrased web text. We show that \textsc{FinePhrase} outperforms all existing synthetic data baselines while reducing generation costs by up to 30 times. We provide the dataset, all prompts, and the generation framework to the research community.

关键词: synthetic data, pretraining data, large language models, prompt design, generator model, source data, FinePhrase dataset, rephrasing web text

37. ❌ [Emerging Ideas] Artificial Tripartite Intelligence: A Bio-Inspired, Sensor-First Architecture for Physical AI

作者: You Rim Choi, Subeom Park, Hyung-Sin Kim 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13959v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种用于物理AI的生物启发式传感器优先架构（ATI），主要关注嵌入式AI系统中的传感器控制、自适应感知和边缘-云协同推理。与大多数关键词（如LLM、MoE、训练方法、推理加速等）无直接关联。仅与两个关键词有弱关联：1）‘Small Language Models OR SLMs OR On-device AI’（5分）：论文强调在设备上执行时间关键的感知和控制，符合边缘AI/设备端AI概念；2）‘System 2 Thinking OR Slow Thinking OR In-depth Reasoning’（5分）：ATI架构包含支持深度推理的L3/L4子系统，涉及高级认知功能。其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

该论文针对物理AI在延迟、能耗和可靠性约束下的挑战，提出了一种生物启发的传感器优先架构（ATI），通过分层模块化设计协同优化感知与推理，在移动相机原型中实现了端到端准确率从53.8%提升至88%并减少43.3%远程推理调用。

摘要翻译

随着人工智能从数据中心走向机器人与可穿戴设备，仅靠扩大模型规模已显不足。物理人工智能在严格的延迟、能耗、隐私和可靠性约束下运行，其性能不仅取决于模型能力，更取决于如何在动态环境中通过可控传感器获取信号。我们提出仿生式“人工三元智能”架构——一种面向物理人工智能的传感器优先架构契约。该系统在架构层面呈现三元分立：脑干层级负责反射性安全与信号完整性控制，小脑层级执行连续传感器校准，而跨越L3/L4层的大脑推理子系统则支撑常规技能选择与执行、协调及深度推理。这种模块化设计使得传感器控制、自适应感知、边缘-云端协同计算与基础模型推理能在同一闭环架构中共生演进，同时将时间敏感的感知与控制保留在设备端，仅在需要时调用高层级推理。我们在动态光照与运动场景下的移动摄像原型中实现了该架构。在路由评估中，相较于默认自动曝光设置，人工三元智能的自适应感知机制将端到端准确率从53.8%提升至88%，同时将远程L4层调用减少43.3%。这些结果证明了为具身人工智能协同设计感知与推理系统的价值。

摘要 (Abstract)

As AI moves from data centers to robots and wearables, scaling ever-larger models becomes insufficient. Physical AI operates under tight latency, energy, privacy, and reliability constraints, and its performance depends not only on model capacity but also on how signals are acquired through controllable sensors in dynamic environments. We present Artificial Tripartite Intelligence (ATI), a bio-inspired, sensor-first architectural contract for physical AI. ATI is tripartite at the systems level: a Brainstem (L1) provides reflexive safety and signal-integrity control, a Cerebellum (L2) performs continuous sensor calibration, and a Cerebral Inference Subsystem spanning L3/L4 supports routine skill selection and execution, coordination, and deep reasoning. This modular organization allows sensor control, adaptive sensing, edge-cloud execution, and foundation model reasoning to co-evolve within one closed-loop architecture, while keeping time-critical sensing and control on device and invoking higher-level inference only when needed. We instantiate ATI in a mobile camera prototype under dynamic lighting and motion. In our routed evaluation (L3-L4 split inference), compared to the default auto-exposure setting, ATI (L1/L2 adaptive sensing) improves end-to-end accuracy from 53.8% to 88% while reducing remote L4 invocations by 43.3%. These results show the value of co-designing sensing and inference for embodied AI.

关键词: Physical AI, Sensor-first Architecture, Bio-inspired Design, Adaptive Sensing, Edge-Cloud Inference, Closed-loop System, Embodied AI, Tripartite Intelligence

38. ❌ Creo: From One-Shot Image Generation to Progressive, Co-Creative Ideation

作者: Zoe De Simone, Angie Boggust, Fredo Durand, Ashia Wilson, Arvind Satyanarayan 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13956v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是文本到图像（T2I）生成系统的交互设计问题，提出了一种多阶段、渐进式的图像生成方法（Creo系统），以增强用户控制力、创造性和输出多样性。论文的核心贡献在于人机交互（HCI）和生成式AI系统的用户体验设计，而非大模型或深度学习技术本身的创新。所有评分关键词均聚焦于大模型技术原理（如架构、训练、推理、对齐、应用范式等）或特定科学领域应用，而本文并未涉及这些具体技术，也未将大模型应用于科学领域。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对现有文本到图像（T2I）系统在生成过程中用户控制力弱、创意受限的问题，提出了一个名为Creo的多阶段渐进式图像生成系统，通过引入草图式中间表示、决策锁定和差异应用等机制，使用户能够进行逐步、精细的控制，从而增强了用户的主体性、创造力和输出多样性。

摘要翻译

文本到图像（Text-to-image, T2I）系统能够快速生成高保真度的图像，但其生成过程与视觉构思的实际发展方式存在偏差。T2I系统生成的输出会代表用户做出隐性的视觉决策，常常过早引入精细的细节，从而可能使用户的思维过早固化，限制他们在构思初期保持选项开放的能力；同时，在编辑过程中容易引发难以纠正的意外变化，削弱用户的控制感。为解决这些问题，我们提出了Creo——一个多阶段的T2I系统，它通过从粗略草图逐步推进至高分辨率输出的方式搭建图像生成流程，并展示中间抽象表示，使用户能够进行渐进式修改。草图式的抽象表示因其临时性特质，能够邀请用户参与编辑，并在构思尚未定型时帮助用户保持设计选项的开放。Creo的每个阶段都支持手动修改与AI辅助操作，并通过一种锁定机制实现精细的、逐步的控制：该机制保留先前的决策，确保后续编辑仅影响指定区域或属性。用户始终参与循环，在各阶段做出并验证决策，而系统则采用差异更新而非完整重生成的方式，从而在保真度提升过程中减少偏离。与一次性生成基线系统的对比研究表明，参与者对Creo的输出拥有更强的所有权感，因为他们能够追溯自己在图像构建过程中的各项决策。此外，基于嵌入的分析表明，Creo的输出结果比一次性生成的结果更具异质性。这些发现说明，多阶段生成结合中间控制与决策锁定，是提升生成系统的可控性、用户能动性、创造力及输出多样性的关键设计原则。

摘要 (Abstract)

Text-to-image (T2I) systems enable rapid generation of high-fidelity imagery but are misaligned with how visual ideas develop. T2I systems generate outputs that make implicit visual decisions on behalf of the user, often introduce fine-grained details that can anchor users prematurely and limit their ability to keep options open early on, and cause unintended changes during editing that are difficult to correct and reduce users’ sense of control. To address these concerns, we present Creo, a multi-stage T2I system that scaffolds image generation by progressing from rough sketches to high-resolution outputs, exposing intermediary abstractions where users can make incremental changes. Sketch-like abstractions invite user editing and allow users to keep design options open when ideas are still forming due to their provisional nature. Each stage in Creo can be modified with manual changes and AI-assisted operations, enabling fine-grained, step-wise control through a locking mechanism that preserves prior decisions so subsequent edits affect only specified regions or attributes. Users remain in the loop, making and verifying decisions across stages, while the system applies diffs instead of regenerating full images, reducing drift as fidelity increases. A comparative study with a one-shot baseline shows that participants felt stronger ownership over Creo outputs, as they were able to trace their decisions in building up the image. Furthermore, embedding-based analysis indicates that Creo outputs are less homogeneous than one-shot results. These findings suggest that multi-stage generation, combined with intermediate control and decision locking, is a key design principle for improving controllability, user agency, creativity, and output diversity in generative systems.

关键词: text-to-image generation, multi-stage generation, user control, progressive ideation, interactive AI, creative process, human-AI collaboration, generative systems

39. ❌ HINTBench: Horizon-agent Intrinsic Non-attack Trajectory Benchmark

作者: Jiacheng Wang, Jinchang Hou, Fabian Wang, Ping Jian, Chenfu Bao, Zhonghou Lv 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13954v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究LLM智能体的内在风险审计，与’LLM Agents OR Autonomous Agents OR Agentic Workflow’高度相关（10分），因为论文核心是评估智能体在良性条件下的安全轨迹。与’Large Language Models OR LLMs OR Foundation Models’有一定关联（8分），因为实验使用了强LLMs进行评估。其他关键词如MoE、SLMs、训练技术、推理优化、科学AI应用等均未在论文中涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文研究了LLM智能体在良性条件下可能进入不安全轨迹的内在风险问题，通过构建HINTBench基准发现现有LLMs在风险检测上表现良好，但在风险步骤定位和细粒度故障诊断上存在显著能力差距。

摘要翻译

现有智能体安全性评估主要聚焦于外部诱发的风险。然而，在良性条件下，智能体仍可能进入不安全轨迹。我们通过内在风险的视角研究这一互补但尚未充分探索的情境，其中内在故障保持潜伏状态，在长周期执行中传播，并最终导致严重后果。为评估此情境，我们提出非攻击性内在风险审计方法，并推出HINTBench基准测试集。该基准包含629条智能体轨迹（523条风险轨迹，106条安全轨迹；平均33个步骤），支持三项任务：风险检测、风险步骤定位和内在故障类型识别。其标注体系基于统一的五约束分类法组织。实验揭示了显著的能力差距：强大语言模型在轨迹级风险检测上表现良好，但在风险步骤定位任务中严格F1分数降至35%以下，而细粒度故障诊断则更为困难。现有防护模型在此情境下迁移效果不佳。这些发现确立了内在风险审计作为智能体安全领域一个亟待解决的开放挑战。

摘要 (Abstract)

Existing agent-safety evaluation has focused mainly on externally induced risks. Yet agents may still enter unsafe trajectories under benign conditions. We study this complementary but underexplored setting through the lens of \emph{intrinsic} risk, where intrinsic failures remain latent, propagate across long-horizon execution, and eventually lead to high-consequence outcomes. To evaluate this setting, we introduce \emph{non-attack intrinsic risk auditing} and present \textbf{HINTBench}, a benchmark of 629 agent trajectories (523 risky, 106 safe; 33 steps on average) supporting three tasks: risk detection, risk-step localization, and intrinsic failure-type identification. Its annotations are organized under a unified five-constraint taxonomy. Experiments reveal a substantial capability gap: strong LLMs perform well on trajectory-level risk detection, but their performance drops to below 35 Strict-F1 on risk-step localization, while fine-grained failure diagnosis proves even harder. Existing guard models transfer poorly to this setting. These findings establish intrinsic risk auditing as an open challenge for agent safety.

关键词: agent safety, intrinsic risk, trajectory benchmark, risk detection, risk-step localization, failure diagnosis, LLM agents, non-attack auditing

40. ❌ AI-Assisted Peer Review at Scale: The AAAI-26 AI Review Pilot

作者: Joydeep Biswas, Sheila Schoepp, Gautham Vasan, Anthony Opipari, Arthur Zhang, Zichao Hu, Sebastian Joseph, Matthew Lease, Junyi Jessy Li, Peter Stone, Kiri L. Wagstaff, Matthew E. Taylor, Odest Chadwicke Jenkins 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13940v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心是AI辅助同行评审的大规模应用，直接涉及LLMs（使用前沿模型）、AI Agents（多阶段流程）和Tool Use（系统结合工具使用），属于AI for Science的具体应用。其他关键词如MoE、SFT、RAG等未在摘要中提及，与论文技术细节无关。

!!! tip deepseek-chat TL;DR

该论文研究了AI能否在真实会议规模上生成技术可靠的同行评审，通过在AAAI-26会议上为所有22,977篇论文部署AI评审系统，发现AI评审在技术准确性和研究建议方面优于人类评审，证明了AI对大规模科学评审的贡献。

摘要翻译

随着投稿量激增，科学同行评审体系正面临日益增长的压力，维持评审质量、一致性与时效性变得愈发困难。人工智能的最新进展促使学界开始探索其在同行评审中的应用，但一个尚未解决的关键问题是：AI能否在现实会议规模下生成技术层面可靠的评审意见？本文报告了首次大规模AI辅助同行评审的实际部署：AAAI-26所有主轨投稿均收到一份由前沿系统生成的、明确标识的AI评审报告。该系统融合尖端模型、工具调用与安全防护机制，通过多阶段流程在不足一天内为全部22,977篇完整评审论文生成了评审意见。对AAAI-26作者和程序委员会成员的大规模调查显示，参与者不仅认为AI评审具有实用价值，更在技术准确性和研究建议等关键维度上对其评价优于人工评审。我们还引入了一项新颖的基准测试，发现该系统在检测各类科学缺陷方面显著优于简单的LLM生成评审基线。这些结果表明，当前最先进的AI方法已能在会议规模下为科学同行评审作出实质性贡献，为构建下一代人机协同的科研评估体系开辟了道路。

摘要 (Abstract)

Scientific peer review faces mounting strain as submission volumes surge, making it increasingly difficult to sustain review quality, consistency, and timeliness. Recent advances in AI have led the community to consider its use in peer review, yet a key unresolved question is whether AI can generate technically sound reviews at real-world conference scale. Here we report the first large-scale field deployment of AI-assisted peer review: every main-track submission at AAAI-26 received one clearly identified AI review from a state-of-the-art system. The system combined frontier models, tool use, and safeguards in a multi-stage process to generate reviews for all 22,977 full-review papers in less than a day. A large-scale survey of AAAI-26 authors and program committee members showed that participants not only found AI reviews useful, but actually preferred them to human reviews on key dimensions such as technical accuracy and research suggestions. We also introduce a novel benchmark and find that our system substantially outperforms a simple LLM-generated review baseline at detecting a variety of scientific weaknesses. Together, these results show that state-of-the-art AI methods can already make meaningful contributions to scientific peer review at conference scale, opening a path toward the next generation of synergistic human-AI teaming for evaluating research.

关键词: AI-assisted peer review, large-scale deployment, frontier models, tool use, scientific peer review, human-AI teaming, conference scale, technical accuracy

41. ❌ ASTER: Latent Pseudo-Anomaly Generation for Unsupervised Time-Series Anomaly Detection

作者: Romain Hermary, Samet Hicsonmez, Dan Pineau, Abd El Rahman Shabayek, Djamila Aouada 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13924v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文主要研究时间序列异常检测（TSAD），提出ASTER框架，使用预训练LLM来增强时间序列的潜在空间表示，因此与’Large Language Models’相关（8分）。论文涉及工业监控、医疗保健等科学应用领域，与’AI for Science’有一定关联（5分）。其他关键词如MoE、SFT、RAG等未在摘要中提及，与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对时间序列异常检测中标注数据稀缺和异常异质性的挑战，提出了ASTER框架，通过预训练LLM增强潜在空间表示并生成伪异常来训练Transformer分类器，在三个基准数据集上实现了最先进的性能。

摘要翻译

时间序列异常检测（TSAD）在工业监控、医疗保健和网络安全等领域至关重要，但由于异常情况罕见且异构，以及标注数据稀缺，该任务仍具挑战性。数据稀缺使得无监督方法占据主导地位，然而现有方法通常依赖于重构或预测（这些方法在处理复杂数据时存在困难），或依赖于基于嵌入的方法（这些方法需要特定领域的异常合成和固定的距离度量）。我们提出了ASTER框架，该框架直接在潜在空间中生成伪异常，避免了手工注入异常的需求，也无需领域专业知识。一个潜在空间解码器生成定制的伪异常，用于训练基于Transformer的异常分类器，同时一个预训练的大型语言模型（LLM）丰富了该空间的时间与上下文表征。在三个基准数据集上的实验表明，ASTER实现了最先进的性能，并为基于LLM的TSAD设立了新标准。

摘要 (Abstract)

Time-series anomaly detection (TSAD) is critical in domains such as industrial monitoring, healthcare, and cybersecurity, but it remains challenging due to rare and heterogeneous anomalies and the scarcity of labelled data. This scarcity makes unsupervised approaches predominant, yet existing methods often rely on reconstruction or forecasting, which struggle with complex data, or on embedding-based approaches that require domain-specific anomaly synthesis and fixed distance metrics. We propose ASTER, a framework that generates pseudo-anomalies directly in the latent space, avoiding handcrafted anomaly injections and the need for domain expertise. A latent-space decoder produces tailored pseudo-anomalies to train a Transformer-based anomaly classifier, while a pre-trained LLM enriches the temporal and contextual representations of this space. Experiments on three benchmark datasets show that ASTER achieves state-of-the-art performance and sets a new standard for LLM-based TSAD.

关键词: Time-series anomaly detection, Unsupervised learning, Pseudo-anomaly generation, Latent space, Transformer, Large Language Models, State-of-the-art performance

42. ❌ Do We Still Need Humans in the Loop? Comparing Human and LLM Annotation in Active Learning for Hostility Detection

作者: Ahmad Dawar Hakimi, Lea Hirlimann, Isabelle Augenstein, Hinrich Schütze 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13899v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究指令调优的LLMs（GPT-5.2）在文本标注任务中的应用，与’Large Language Models’和’Instruction Tuning’高度相关（10分），因为直接使用指令调优的LLM进行标注。其他关键词如MoE、SLMs、Scaling Laws、Pre-training、RLHF、RAG、CoT、Agents、Quantization、AI for Science等均未在摘要中提及或涉及，因此评0分。

!!! tip deepseek-chat TL;DR

该研究比较了指令调优大语言模型（GPT-5.2）与人类在主动学习中对德语政治TikTok评论进行敌意标注的效果，发现LLM能以更低成本（$43 vs $316）达到与人类标注相当的宏观F1分数，但LLM训练的模型在主题模糊的讨论中更容易过度预测正类，表明选择标注策略时需考虑错误分布而不仅仅是总体性能。

摘要翻译

指令微调的大型语言模型（LLM）能够以极低成本，通过简短提示为数千条实例进行标注。这为主动学习（AL）带来了两个问题：在AL循环中，LLM生成的标注能否替代人工标注？以及当整个语料库可以一次性完成标注时，AL是否仍然必要？我们在一个包含277,902条德国政治TikTok评论的新数据集上（其中25,974条由LLM标注，5,000条由人工标注）对这两个问题进行了研究，通过四种编码器比较了七种标注策略，以检测反移民敌意。使用25,974条GPT-5.2标注（成本43美元）训练的分类器，其F1-Macro值与使用3,800条人工标注（成本316美元）训练的分类器相当。在我们的预富集数据池中，主动学习相比随机采样几乎没有优势，并且在相同成本下，其F1值低于完整的LLM标注。然而，总体F1值的可比性掩盖了误差结构的系统性差异：相对于人工黄金标准，基于LLM训练的分类器会过度预测正类。这种差异主要集中在主题模糊的讨论中，即反移民敌意与政策批评之间的界限最为微妙，这表明标注策略的选择不应仅基于总体F1值，而应取决于目标应用可接受的误差特征。

摘要 (Abstract)

Instruction-tuned LLMs can annotate thousands of instances from a short prompt at negligible cost. This raises two questions for active learning (AL): can LLM labels replace human labels within the AL loop, and does AL remain necessary when entire corpora can be labelled at once? We investigate both questions on a new dataset of 277,902 German political TikTok comments (25,974 LLM-labelled, 5,000 human-annotated), comparing seven annotation strategies across four encoders to detect anti-immigrant hostility. A classifier trained on 25,974 GPT-5.2 labels ($43) achieves comparable F1-Macro to one trained on 3,800 human annotations ($316). Active learning offers little advantage over random sampling in our pre-enriched pool and delivers lower F1 than full LLM annotation at the same cost. However, comparable aggregate F1 masks a systematic difference in error structure: LLM-trained classifiers over-predict the positive class relative to the human gold standard. This divergence concentrates in topically ambiguous discussions where the distinction between anti-immigrant hostility and policy critique is most subtle, suggesting that annotation strategy should be guided not by aggregate F1 alone but by the error profile acceptable for the target application.

关键词: Large Language Models, Instruction Tuning, Active Learning, Annotation, Hostility Detection, GPT-5.2, Human vs LLM, Error Analysis

43. ❌ Beyond Conservative Automated Driving in Multi-Agent Scenarios via Coupled Model Predictive Control and Deep Reinforcement Learning

作者: Saeed Rahmani, Gözde Körpe, Zhenlin, Xu, Bruno Brito, Simeon Craig Calvert, Bart van Arem 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13891v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training	0.0	0.0/10	0.0
Post-training	0.0	0.0/10	0.0
Instruction Tuning	0.0	0.0/10	0.0
RLHF	0.0	0.0/10	0.0
PEFT	0.0	0.0/10	0.0
Retrieval-Augmented Generation	0.0	0.0/10	0.0
Context Window Extension	0.0	0.0/10	0.0
KV Cache Compression	0.0	0.0/10	0.0
Chain of Thought	0.0	0.0/10	0.0
System 2 Thinking	0.0	0.0/10	0.0
Monte Carlo Tree Search AND LLM	0.0	0.0/10	0.0
Self-Correction	0.0	0.0/10	0.0
LLM Agents	0.0	0.0/10	0.0
Tool Use	0.0	0.0/10	0.0
Multi-agent Systems	0.0	5.0/10	0.0
Quantization	0.0	0.0/10	0.0
Speculative Decoding	0.0	0.0/10	0.0
Hallucination Mitigation	0.0	0.0/10	0.0
Mechanistic Interpretability	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging	0.0	0.0/10	0.0
In-context Learning	0.0	0.0/10	0.0
AI for Science	0.0	0.0/10	0.0

评分理由: 论文研究自动驾驶在无信号交叉口的导航问题，提出结合模型预测控制（MPC）和深度强化学习（RL）的框架。所有关键词均与大模型、深度学习技术原理或AI科学应用相关，但论文仅涉及深度强化学习在自动驾驶中的应用，未涉及大语言模型、MoE、缩放定律、训练技术、推理优化、AI代理等具体技术。唯一相关的是’Multi-agent Systems’（多智能体系统），因为论文研究多车辆交互场景，但未涉及LLM代理或协调机制，因此给5分（有一定关联）。其他关键词完全无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种结合模型预测控制和深度强化学习的框架，用于解决多智能体场景下自动驾驶在无信号交叉口的导航问题，实验表明该框架相比单独方法能降低碰撞率21%并提高成功率6.5%，且具有更好的跨场景鲁棒性。

摘要翻译

无信号灯交叉路口的自动驾驶因复杂的多车交互及安全与效率的平衡需求而极具挑战性。模型预测控制（Model Predictive Control, MPC）通过优化提供结构化的约束处理能力，但其依赖人工设计的规则，常导致行为过于保守。深度强化学习（Reinforcement Learning, RL）能够从经验中学习自适应行为，但往往难以保证安全性，且在未见环境中的泛化能力不足。本研究提出一种融合MPC与RL的集成框架，以提升多智能体场景下的导航性能。实验表明，在三种交通密度水平下，MPC-RL框架均优于独立的MPC与端到端RL方法。总体而言，相较于纯MPC方法，MPC-RL将碰撞率降低了21%，成功率提高了6.5%。我们进一步评估了该框架在未经重新训练的情况下，向高速公路汇入场景的零样本迁移能力。两种基于MPC的方法其迁移效果均显著优于端到端近端策略优化（PPO），这凸显了MPC主干在跨场景鲁棒性中的关键作用。此外，该框架在训练过程中比端到端RL更快达到损失稳定，表明其学习负担有所减轻。这些结果表明，集成方法能够改善多智能体交叉路口场景中安全性能与效率之间的平衡，同时MPC组件为驾驶环境的跨场景泛化提供了坚实基础。本研究的实现代码已开源提供。

摘要 (Abstract)

Automated driving at unsignalized intersections is challenging due to complex multi-vehicle interactions and the need to balance safety and efficiency. Model Predictive Control (MPC) offers structured constraint handling through optimization but relies on hand-crafted rules that often produce overly conservative behavior. Deep Reinforcement Learning (RL) learns adaptive behaviors from experience but often struggles with safety assurance and generalization to unseen environments. In this study, we present an integrated MPC-RL framework to improve navigation performance in multi-agent scenarios. Experiments show that MPC-RL outperforms standalone MPC and end-to-end RL across three traffic-density levels. Collectively, MPC-RL reduces the collision rate by 21% and improves the success rate by 6.5% compared to pure MPC. We further evaluate zero-shot transfer to a highway merging scenario without retraining. Both MPC-based methods transfer substantially better than end-to-end PPO, which highlights the role of the MPC backbone in cross-scenario robustness. The framework also shows faster loss stabilization than end-to-end RL during training, which indicates a reduced learning burden. These results suggest that the integrated approach can improve the balance between safety performance and efficiency in multi-agent intersection scenarios, while the MPC component provides a strong foundation for generalization across driving environments. The implementation code is available open-source.

关键词: Automated Driving, Multi-agent Scenarios, Model Predictive Control, Deep Reinforcement Learning, Unsignalized Intersections, Safety and Efficiency, Cross-scenario Robustness, MPC-RL Framework

44. ❌ Evaluating Supervised Machine Learning Models: Principles, Pitfalls, and Metric Selection

作者: Xuanyan Liu, Ignacio Cabrera Martin, Marcello Trovati, Xiaolong Xu, Nikolaos Polatidis 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13882v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于监督机器学习模型的评估方法、原则和指标选择，属于传统机器学习方法论范畴。论文内容完全不涉及大语言模型、深度学习技术原理、大模型应用或任何评分关键词中提到的具体技术（如MoE、RLHF、RAG、量化等）。所有关键词均与大模型、深度学习及相关创新技术相关，而本文讨论的是通用的监督学习评估问题，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文研究了监督机器学习模型评估中的原则、常见陷阱和指标选择问题，通过实验揭示了准确率悖论、数据泄露等评估误区，并提出了与任务目标对齐的评估框架。

摘要翻译

监督机器学习模型的评估是构建可靠预测系统的关键环节。尽管机器学习库和自动化工作流已广泛普及，模型评估却往往简化为报告少量聚合指标，这可能导致对模型实际性能的误导性结论。本文系统探讨了在分类与回归任务中评估监督学习算法的基本原则、挑战与实践考量。文章重点分析了数据集特性、验证设计、类别不平衡、非对称错误成本以及性能指标选择如何影响评估结果。通过使用多样化基准数据集进行一系列受控实验，本研究揭示了常见误区，例如准确率悖论、数据泄露、不当的指标选择以及对标量汇总指标的过度依赖。本文还对比了不同的验证策略，并强调了模型评估需与任务预期应用目标保持一致的重要性。通过将评估呈现为一个以决策为导向且依赖上下文的过程，本研究为选择支持统计可靠、稳健且可信赖的监督机器学习系统的评估指标与验证方案提供了结构化基础。

摘要 (Abstract)

The evaluation of supervised machine learning models is a critical stage in the development of reliable predictive systems. Despite the widespread availability of machine learning libraries and automated workflows, model assessment is often reduced to the reporting of a small set of aggregate metrics, which can lead to misleading conclusions about real-world performance. This paper examines the principles, challenges, and practical considerations involved in evaluating supervised learning algorithms across classification and regression tasks. In particular, it discusses how evaluation outcomes are influenced by dataset characteristics, validation design, class imbalance, asymmetric error costs, and the choice of performance metrics. Through a series of controlled experimental scenarios using diverse benchmark datasets, the study highlights common pitfalls such as the accuracy paradox, data leakage, inappropriate metric selection, and overreliance on scalar summary measures. The paper also compares alternative validation strategies and emphasizes the importance of aligning model evaluation with the intended operational objective of the task. By presenting evaluation as a decision-oriented and context-dependent process, this work provides a structured foundation for selecting metrics and validation protocols that support statistically sound, robust, and trustworthy supervised machine learning systems.

关键词: supervised machine learning, model evaluation, performance metrics, validation design, class imbalance, accuracy paradox, data leakage, decision-oriented evaluation

45. ❌ SparseBalance: Load-Balanced Long Context Training with Dynamic Sparse Attention

作者: Hongtao Xu, Jianchao Tan, Yuxuan Hu, Pengju Lu, Hongyu Wang, Pingwei Sun, Yerui Sun, Yuchen Xie, Xunliang Cai, Mingzhen Li, Weile Jia 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13847v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	10.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	5.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究长上下文LLM训练中的稀疏注意力优化，与’Large Language Models’、‘Mixture of Experts OR MoE OR Sparse Models’、‘Context Window Extension OR Long Context LLMs’高度相关（10分）。涉及计算效率优化，与’KV Cache Compression OR Linear Attention OR FlashAttention’、‘Speculative Decoding OR Inference Acceleration’有一定关联（5分）。其他关键词如小模型、对齐、科学AI应用等未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文针对长上下文大语言模型训练中稀疏注意力导致的负载不均衡问题，提出了SparseBalance算法-系统协同设计框架，通过动态稀疏度调整和稀疏感知批处理策略，在LongBench基准上实现了1.33倍端到端加速并提升长上下文能力0.46%。

摘要翻译

尽管稀疏注意力机制缓解了长上下文大语言模型训练的计算瓶颈，但其分布式训练过程在以下两方面表现出极端异构性：\textit{1)} 序列长度与 \textit{2)} 稀疏度敏感性，这导致了严重的负载不均衡问题及次优的模型精度。现有算法与训练框架通常仅关注单一问题，未能系统性地协同优化这两个难题。为此，我们提出了SparseBalance，一种新颖的算法-系统协同设计框架，该框架利用稀疏性与序列异构性来联合优化模型精度与系统效率。首先，我们提出了工作负载感知的动态稀疏度调优，该方法采用双向稀疏度调整以消除掉队者，并利用固有的计算气泡来无损提升精度。其次，我们提出了一种稀疏度感知的批处理策略以实现粗粒度负载均衡，该策略与动态稀疏度调优形成互补。实验结果表明，SparseBalance在LongBench基准测试上实现了最高1.33倍的端到端加速，同时仍将长上下文能力提升了0.46%。

摘要 (Abstract)

While sparse attention mitigates the computational bottleneck of long-context LLM training, its distributed training process exhibits extreme heterogeneity in both \textit{1)} sequence length and \textit{2)} sparsity sensitivity, leading to a severe imbalance problem and sub-optimal model accuracy. Existing algorithms and training frameworks typically focus on single issue, failing to systematically co-optimize these two problems. Therefore, we propose SparseBalance, a novel algorithm-system co-design framework, which exploits the sparsity and sequence heterogeneity to optimize model accuracy and system efficiency jointly. First, we propose workload-aware dynamic sparsity tuning, which employs a bidirectional sparsity adjustment to eliminate stragglers and exploit inherent bubbles for free accuracy. Second, we propose a sparsity-aware batching strategy to achieve coarse-grained balance, which complements dynamic sparsity tuning. Experimental results demonstrate that SparseBalance achieves up to a 1.33$\times$ end-to-end speedup while still improving the long-context capability by 0.46% on the LongBench benchmark.

关键词: Sparse Attention, Long-context LLM Training, Load Balancing, Dynamic Sparsity Tuning, Distributed Training, Sequence Heterogeneity, Sparsity-aware Batching, Computational Efficiency

46. ❌ Sentiment analysis for software engineering: How far can zero-shot learning (ZSL) go?

作者: Reem Alfayez, Manal Binkhonain 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13826v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究软件工程领域的情感分析，主要关注零样本学习（ZSL）技术，包括嵌入、NLI、TARS和生成式方法，并与微调的Transformer模型比较。论文未涉及大模型、深度学习技术原理创新或AI for Science等关键词，所有关键词均与论文内容无关。

!!! tip deepseek-chat TL;DR

本研究探索了零样本学习在软件工程情感分析中的潜力，发现结合专家标注标签的嵌入或生成式ZSL方法能达到与微调Transformer模型相当的F1分数，为解决标注数据稀缺问题提供了方案。

摘要翻译

软件工程中的情感分析旨在理解软件制品中表达的情绪。先前研究指出，通用现成情感分析工具在软件工程领域应用存在局限性，表明需要针对不同软件工程场景开发专用工具。此类工具的研发高度依赖监督机器学习技术，而这类技术需要标注数据集的支持。获取此类数据集面临重大挑战，因为它需要领域专业知识且耗费大量人力。本研究旨在探索零样本学习在缓解软件工程情感分析中标注数据稀缺问题方面的潜力。方法：我们通过实证实验评估了多种零样本学习技术的性能，包括基于嵌入、基于自然语言推理、基于TARS以及基于生成的零样本学习技术。我们在不同标签设置下评估这些技术的表现，以考察标签配置的影响。此外，我们将零样本学习技术与基于微调Transformer的先进模型结果进行了对比。最后，我们进行了错误分析以识别误分类的主要原因。结果：研究发现，零样本学习技术——特别是将专家构建的标签与基于嵌入或基于生成的模型相结合时——能够达到与微调Transformer模型相当的宏观F1分数。错误分析表明，标注的主观性和事实极性是导致零样本学习误分类的主要因素。结论：本研究证实了零样本学习在软件工程情感分析中的应用潜力。通过降低对标注数据的依赖，零样本学习能为标注数据集稀缺的挑战提供解决方案。

摘要 (Abstract)

Sentiment analysis in software engineering focuses on understanding emotions expressed in software artifacts. Previous research highlighted the limitations of applying general off-the-shelf sentiment analysis tools within the software engineering domain and indicated the need for specialized tools tailored to various software engineering contexts. The development of such tools heavily relies on supervised machine learning techniques that necessitate annotated datasets. Acquiring such datasets is a substantial challenge, as it requires domain-specific expertise and significant effort. Objective: This study explores the potential of ZSL to address the scarcity of annotated datasets in sentiment analysis within software engineering Method:} We conducted an empirical experiment to evaluate the performance of various ZSL techniques, including embedding-based, NLI-based, TARS-based, and generative-based ZSL techniques. We assessed the performance of these techniques under different labels setups to examine the impact of label configurations. Additionally, we compared the results of the ZSL techniques with state-of-the-art fine-tuned transformer-based models. Finally, we performed an error analysis to identify the primary causes of misclassifications. Results: Our findings demonstrate that ZSL techniques, particularly those combining expert-curated labels with embedding-based or generative-based models, can achieve macro-F1 scores comparable to fine-tuned transformer-based models. The error analysis revealed that subjectivity in annotation and polar facts are the main contributors to ZSL misclassifications. Conclusion: This study demonstrates the potential of ZSL for sentiment analysis in software engineering. ZSL can provide a solution to the challenge of annotated dataset scarcity by reducing reliance on annotated dataset.

关键词: sentiment analysis, software engineering, zero-shot learning, ZSL, embedding-based, generative models, transformer models, annotated datasets

47. ❌ Cognitive Offloading in Agile Teams: How Artificial Intelligence Reshapes Risk Assessment and Planning Quality

作者: Adriana Caraeni, Alexander Shick, Andrew Lan 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13814v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究AI在敏捷项目管理中的应用，特别是认知卸载对风险评估和规划质量的影响，属于AI应用研究。但所有关键词均聚焦于大模型/深度学习技术原理、架构、训练方法、推理优化、对齐、代理系统等具体技术细节，而论文仅提及通用的"AI"或"algorithmic tools"，未涉及任何特定的大模型技术、架构、训练方法或优化技术，也未涉及科学领域的AI应用（如生物信息学）。因此，所有关键词均与论文内容完全无关，得0分。

!!! tip deepseek-chat TL;DR

该研究通过实验比较AI-only、human-only和hybrid三种敏捷冲刺规划模型，发现AI-only规划虽高效但风险捕获率低且返工率高，而human-only规划适应性强但开销大，因此提出了一个混合AI-人类规划框架，将算法工具用于估算和待办事项格式化，同时要求人类参与风险评估和模糊性解决。

摘要翻译

人工智能（AI）的最新进展在自动化敏捷项目管理的关键环节方面展现出潜力，但其对团队认知的影响仍待深入探究。本研究通过一项受控三条件实验，在中等规模数字机构的实际客户交付项目中，比较纯AI、纯人工及混合规划模型，以探究敏捷冲刺规划中的认知卸载现象。我们采用定量指标——包括估算准确性、返工率与范围变更恢复时间——结合规划稳健性的定性指标，评估各模型在原始效率之外的实际效能。研究发现，纯AI规划虽能最大限度减少时间与成本，但由于未阐明的假设，其风险捕获率显著降低且返工增加；而纯人工规划在适应性方面表现优异，却需承担大量管理开销。基于这些发现，我们提出一个混合AI-人类冲刺规划的理论框架，将算法工具分配于估算与待办事项格式化任务，同时要求人类参与风险评估和模糊性解析。研究结果挑战了“效率等同于效能”的假设，为寻求增强而非削弱团队认知的组织提供了可行的治理策略。

摘要 (Abstract)

Recent advances in artificial intelligence (AI) have shown promise in automating key aspects of Agile project management, yet their impact on team cognition remains underexplored. In this work, we investigate cognitive offloading in Agile sprint planning by conducting a controlled, three-condition experiment comparing AI-only, human-only, and hybrid planning models on a live client deliverable at a mid-sized digital agency. Using quantitative metrics – including estimation accuracy, rework rates, and scope change recovery time – alongside qualitative indicators of planning robustness, we evaluate each model’s effectiveness beyond raw efficiency. We find that while AI-only planning minimizes time and cost, it significantly degrades risk capture rates and increases rework due to unstated assumptions, whereas human-only planning excels at adaptability but incurs substantial overhead. Drawing on these findings, we propose a theoretical framework for hybrid AI-human sprint planning that assigns algorithmic tools to estimation and backlog formatting while mandating human deliberation for risk assessment and ambiguity resolution. Our results challenge the assumption that efficiency equates to effectiveness, offering actionable governance strategies for organizations seeking to augment rather than erode team cognition.

关键词: Cognitive Offloading, Agile Teams, Artificial Intelligence, Risk Assessment, Planning Quality, Hybrid AI-human Planning, Sprint Planning, Team Cognition

48. ❌ Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation

作者: Arya Shah, Vaibhav Tripathi, Mayank Singh, Chaklam Silpasuwanchai 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13803v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究视觉语言模型（VLMs）的对抗鲁棒性，与AI安全相关。主要相关关键词：1）‘Hallucination Mitigation OR Factuality OR Truthfulness’（10分）- 核心研究模型抵抗欺骗性操纵（sycophantic manipulation），直接涉及事实性和真实性；2）‘Instruction Tuning OR Alignment OR Value Alignment’（8分）- 研究模型对齐，特别是视觉表征与人类神经处理的对齐；3）‘Mechanistic Interpretability OR Explainable AI’（8分）- 通过脑对齐分析模型内部表示，属于可解释AI范畴。其他关键词主要针对纯语言模型或特定技术，与本文的视觉语言模型焦点不直接相关。

!!! tip deepseek-chat TL;DR

该研究发现视觉语言模型中早期视觉皮层（V1-V3）的脑对齐与抵抗欺骗性操纵的能力呈负相关，表明低层视觉编码的保真度有助于模型对抗对抗性语言覆盖。

摘要翻译

视觉语言模型正日益部署于高风险场景中，然而其受奉承性操纵的易感性仍未得到充分理解，尤其是在这些模型内部如何表征视觉信息方面。那些视觉表征更接近人类神经处理的模型是否也更能抵抗对抗性压力，这是一个悬而未决的问题，对神经科学和人工智能安全均具有重要意义。我们通过评估12个开源视觉语言模型来研究此问题，这些模型涵盖6种架构家族和40倍参数范围（2.56亿至100亿），评估沿两个维度展开：大脑对齐度（通过从8名人类受试者和6个视觉皮层感兴趣区域的自然场景数据集中预测fMRI响应来测量）和奉承性（通过涵盖5个类别和10个难度级别的76,800个两轮“煤气灯”式诱导提示进行测量）。感兴趣区域分析表明，早期视觉皮层（V1–V3）的对齐度是奉承性的可靠负向预测指标（r = -0.441，BCa 95%置信区间[-0.740, -0.031]），所有12次留一法相关性均为负值，且在存在性否认攻击中效应最强（r = -0.597，p = 0.040）。这种解剖学特异性关系在高级别类别选择区域中并不存在，表明忠实的低层级视觉编码为视觉语言模型提供了可测量的锚点，以抵抗对抗性语言覆盖。我们在GitHub上发布了代码，并在Hugging Face上发布了数据集。

摘要 (Abstract)

Vision-language models are increasingly deployed in high-stakes settings, yet their susceptibility to sycophantic manipulation remains poorly understood, particularly in relation to how these models represent visual information internally. Whether models whose visual representations more closely mirror human neural processing are also more resistant to adversarial pressure is an open question with implications for both neuroscience and AI safety. We investigate this question by evaluating 12 open-weight vision-language models spanning 6 architecture families and a 40$\times$ parameter range (256M–10B) along two axes: brain alignment, measured by predicting fMRI responses from the Natural Scenes Dataset across 8 human subjects and 6 visual cortex regions of interest, and sycophancy, measured through 76,800 two-turn gaslighting prompts spanning 5 categories and 10 difficulty levels. Region-of-interest analysis reveals that alignment specifically in early visual cortex (V1–V3) is a reliable negative predictor of sycophancy ($r = -0.441$, BCa 95% CI $[-0.740, -0.031]$), with all 12 leave-one-out correlations negative and the strongest effect for existence denial attacks ($r = -0.597$, $p = 0.040$). This anatomically specific relationship is absent in higher-order category-selective regions, suggesting that faithful low-level visual encoding provides a measurable anchor against adversarial linguistic override in vision-language models. We release our code on \href{https://github.com/aryashah2k/Gaslight-Gatekeep-Sycophantic-Manipulation}{GitHub} and dataset on \href{https://huggingface.co/datasets/aryashah00/Gaslight-Gatekeep-V1-V3}{Hugging Face}

关键词: vision-language models, sycophantic manipulation, brain alignment, early visual cortex, adversarial robustness, fMRI responses, neural processing, AI safety

49. ❌ AlphaCNOT: Learning CNOT Minimization with Model-Based Planning

作者: Jacopo Cossio, Daniele Lizzio Bosco, Riccardo Romanello, Giuseppe Serra, Carla Piazza 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13812v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	5.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究量子电路优化，特别是CNOT门最小化问题，使用基于蒙特卡洛树搜索（MCTS）的强化学习框架AlphaCNOT。该论文与大多数关键词无关，因为这些关键词主要涉及大语言模型（LLMs）及其相关技术（如微调、对齐、推理优化等），而本论文专注于量子计算和强化学习在特定科学问题中的应用。仅有两个关键词有相关性：1. ‘Monte Carlo Tree Search OR MCTS AND LLM’：论文明确使用MCTS作为其RL框架的核心部分，但未涉及LLM，因此给5分（中等关联）。2. ‘AI for Science OR Bioinformatics OR Cheminformatics’：论文将AI（特别是强化学习）应用于量子计算这一科学领域，符合’AI for Science’的范畴，因此给5分（中等关联）。其他关键词均未在论文中提及或相关。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于蒙特卡洛树搜索的强化学习框架AlphaCNOT，用于解决量子电路中的CNOT门最小化问题，在线性可逆合成中相比基准方法减少了高达32%的CNOT门数量。

摘要翻译

量子电路优化是量子计算中的核心任务，因为当前的中等规模含噪声量子设备受限于误差传播，其影响通常随操作数量增加而加剧。在量子操作中，CNOT门具有基础性意义，它是通用Clifford+T门集中唯一的双量子比特门。CNOT门最小化问题已通过启发式算法得到研究，例如在线性可逆综合（即无拓扑约束的CNOT最小化）中著名的Patel-Markov-Hayes（PMH）方法；近年来，在更复杂的拓扑感知综合场景中（其中每个CNOT门仅能作用于部分量子比特对），也开始出现基于强化学习（Reinforcement Learning, RL）的策略。本文提出AlphaCNOT，一种基于蒙特卡洛树搜索（Monte Carlo Tree Search, MCTS）的强化学习框架，通过将CNOT最小化问题建模为规划问题来有效求解。与其他基于强化学习的方法不同，我们的方法是基于模型的，即能够利用前瞻搜索来评估未来轨迹，从而找到更高效的CNOT门序列。在线性可逆综合任务中，我们的方法相比PMH基准实现了CNOT门数量最高达32%的减少；在拓扑约束版本中，针对最多8量子比特的多种拓扑结构，我们相较于当前最先进的基于强化学习的解决方案也实现了持续的门数量降低。我们的结果表明，强化学习与基于搜索的策略相结合可应用于不同的电路优化任务（如Clifford门最小化），从而推动向“量子效用”时代的过渡。

摘要 (Abstract)

Quantum circuit optimization is a central task in Quantum Computing, as current Noisy Intermediate Scale Quantum devices suffer from error propagation that often scales with the number of operations. Among quantum operations, the CNOT gate is of fundamental importance, being the only 2-qubit gate in the universal Clifford+T set. The problem of CNOT gates minimization has been addressed by heuristic algorithms such as the well-known Patel-Markov-Hayes (PMH) for linear reversible synthesis (i.e., CNOT minimization with no topological constraints), and more recently by Reinforcement Learning (RL) based strategies in the more complex case of topology-aware synthesis, where each CNOT can act on a subset of all qubits pairs. In this work we introduce AlphaCNOT, a RL framework based on Monte Carlo Tree Search (MCTS) that address effectively the CNOT minimization problem by modeling it as a planning problem. In contrast to other RL- based solution, our method is model-based, i.e. it can leverage lookahead search to evaluate future trajectories, thus finding more efficient sequences of CNOTs. Our method achieves a reduction of up to 32% in CNOT gate count compared to PMH baseline on linear reversible synthesis, while in the constraint version we report a consistent gate count reduction on a variety of topologies with up to 8 qubits, with respect to state-of-the-art RL-based solutions. Our results suggest the combination of RL with search-based strategies can be applied to different circuit optimization tasks, such as Clifford minimization, thus fostering the transition toward the “quantum utility” era.

关键词: Quantum circuit optimization, CNOT minimization, Reinforcement Learning, Monte Carlo Tree Search, AlphaCNOT, Model-based planning, Quantum Computing, Linear reversible synthesis

50. ❌ Soft $Q(λ)$: A multi-step off-policy method for entropy regularised reinforcement learning using eligibility traces

作者: Pranav Mahajan, Ben Seymour 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13780v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于强化学习算法创新，提出了一种名为Soft Q(λ)的多步离策略方法，用于熵正则化强化学习。论文内容完全围绕强化学习理论、算法推导和数学框架展开，涉及soft Q-learning、资格迹、离策略学习等概念。所有评分关键词均与大语言模型、深度学习技术原理、AI科学应用等相关，而本论文是纯粹的强化学习理论研究，与这些关键词领域无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Soft Q(λ)的多步离策略强化学习框架，通过引入Soft Tree Backup算子和资格迹机制，解决了熵正则化强化学习中高效信用分配的问题。

摘要翻译

软Q学习已成为一种通用的无模型熵正则化强化学习方法，其优化目标为在回报中增加对参考策略偏离度的惩罚项。尽管该方法已取得成功，但软Q学习的多步扩展研究仍相对不足，且目前仅限于玻尔兹曼策略下的同策略动作采样。在本研究简报中，我们首先提出软Q学习的正式$n$步形式化框架，随后通过引入新型软树回溯算子将该框架扩展至完全离策略场景。最终，我们将这些进展统一为软$Q(λ)$算法——一个优雅的在线离策略资格迹框架，能够在任意行为策略下实现高效的信用分配。本推导提出了一种可用于未来实证实验的无模型熵正则化价值函数学习方法。

摘要 (Abstract)

Soft Q-learning has emerged as a versatile model-free method for entropy-regularised reinforcement learning, optimising for returns augmented with a penalty on the divergence from a reference policy. Despite its success, the multi-step extensions of soft Q-learning remain relatively unexplored and limited to on-policy action sampling under the Boltzmann policy. In this brief research note, we first present a formal $n$-step formulation for soft Q-learning and then extend this framework to the fully off-policy case by introducing a novel Soft Tree Backup operator. Finally, we unify these developments into Soft $Q(λ)$, an elegant online, off-policy, eligibility trace framework that allows for efficient credit assignment under arbitrary behaviour policies. Our derivations propose a model-free method for learning entropy-regularised value functions that can be utilised in future empirical experiments.

关键词: Soft Q-learning, entropy-regularised reinforcement learning, multi-step off-policy, eligibility traces, Soft Tree Backup operator, credit assignment, model-free method, value functions

51. ❌ A Dynamic-Growing Fuzzy-Neuro Controller, Application to a 3PSP Parallel Robot

作者: Mohsen Jalaeian-Farimani, Mohammad-R Akbarzadeh-T, Alireza Akbarzadeh, Mostafa Ghaemi 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13763v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是机器人控制领域的动态增长模糊神经控制器（DGFNC），属于传统软计算（模糊系统与神经网络结合）在机器人控制中的应用。所有评分关键词均围绕大模型（LLMs）、深度学习技术原理及其在科学领域的应用，而本文完全不涉及大语言模型、深度学习模型训练、对齐、推理优化、智能体等任何相关技术，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种动态增长模糊神经控制器（DGFNC）与自适应策略相结合的方法，应用于3PSP并联机器人的位置控制问题，通过仿真验证了该方法能实现更快响应、更低计算量并保持系统稳定性。

摘要翻译

迄今为止，多种软计算范式已被用于解决诸多现代问题。其中，模糊系统与神经网络的自组织组合能够构建强大的决策系统。本文将动态增长模糊神经控制器（Dynamic Growing Fuzzy Neural Controller，DGFNC）与自适应策略相结合，应用于3PSP并联机器人的位置控制问题。具体而言，研究对动态增长机制进行了更细致的探讨。与其他自组织方法相比，DGFNC以更为保守的方式添加新规则，因此省略了剪枝机制。取而代之的是，自适应策略使控制系统能够“适应”参数变化。此外，基于滑模的非线性控制器确保了系统稳定性。所提出的通用控制策略旨在以更少的计算量实现更快的响应，同时保持整体稳定性。最终，选择3PSP并联机器人作为研究对象，源于其复杂的动力学特性以及此类方法在现代工业系统中的实用性。多项仿真结果验证了所提出的DGFNC策略在3PSP机器人控制中的优越性。

摘要 (Abstract)

To date, various paradigms of soft-Computing have been used to solve many modern problems. Among them, a self organizing combination of fuzzy systems and neural networks can make a powerful decision making system. Here, a Dynamic Growing Fuzzy Neural Controller (DGFNC) is combined with an adaptive strategy and applied to a 3PSP parallel robot position control problem. Specifically, the dynamic growing mechanism is considered in more detail. In contrast to other self-organizing methods, DGFNC adds new rules more conservatively; hence the pruning mechanism is omitted. Instead, the adaptive strategy ‘adapts’ the control system to parameter variation. Furthermore, a sliding mode-based nonlinear controller ensures system stability. The resulting general control strategy aims to achieve faster response with less computation while maintaining overall stability. Finally, the 3PSP is chosen due to its complex dynamics and the utility of such approaches in modern industrial systems. Several simulations support the merits of the proposed DGFNC strategy as applied to the 3PSP robot.

关键词: Dynamic Growing Fuzzy Neural Controller, DGFNC, 3PSP parallel robot, position control, adaptive strategy, sliding mode control, soft computing, self-organizing

52. ❌ From Anchors to Supervision: Memory-Graph Guided Corpus-Free Unlearning for Large Language Models

作者: Wenxuan Li, Zhenfei Zhang, Mi Zhang, Geng Hong, Mi Wen, Xiaoyu You, Min Yang 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13777v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于大语言模型（LLMs）的机器遗忘（machine unlearning）问题，提出了一种名为MAGE的框架，用于在无需原始训练语料的情况下，通过用户提供的轻量级锚点来引导遗忘过程。该研究直接涉及LLMs的安全性和隐私保护，属于大模型技术应用中的创新方法。因此，仅与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为LLMs是研究的核心对象。其他关键词如MoE、SLMs、Scaling Laws、各种训练方法（Pre-training、SFT、RLHF等）、推理技术（CoT、MCTS）、代理系统、模型压缩、幻觉缓解、可解释性、科学AI等均未在论文中涉及或讨论，故评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型可能记忆敏感或受版权保护内容的问题，提出了一种基于记忆图引导的、无需原始语料的遗忘框架MAGE，仅需用户提供轻量级锚点即可有效实现目标遗忘，同时保持模型整体性能。

摘要翻译

大型语言模型（LLM）可能记忆敏感或受版权保护的内容，引发严重的隐私和法律问题。虽然机器遗忘已成为一种潜在的解决方案，但现有范式依赖于用户提供的遗忘集，这使得遗忘请求难以审计，并使系统面临二次泄露和恶意滥用的风险。我们提出MAGE，一种基于记忆图引导擦除的框架，用于实现用户最小化、无需原始语料的遗忘。仅需用户提供一个用于识别目标实体的轻量级锚点，MAGE即可探测目标LLM以恢复与目标相关的记忆内容，将其组织成加权的局部记忆图，并合成范围明确的监督信号以驱动遗忘过程。MAGE与模型无关，可嵌入标准遗忘方法中使用，且无需访问原始训练语料。在TOFU和RWKU两个基准测试上的实验表明，MAGE通过自生成的监督信号实现了有效的遗忘性能，其效果与依赖外部参考生成的监督信号相当，同时保持了模型的整体效用。这些结果支持了一种实用且可审计的遗忘工作流程，该流程由最小化锚点驱动，而非依赖用户提供的遗忘语料。

摘要 (Abstract)

Large language models (LLMs) may memorize sensitive or copyrighted content, raising significant privacy and legal concerns. While machine unlearning has emerged as a potential remedy, prevailing paradigms rely on user-provided forget sets, making unlearning requests difficult to audit and exposing systems to secondary leakage and malicious abuse. We propose MAGE, a Memory-grAph Guided Erasure framework for user-minimized, corpus-free unlearning. Given only a lightweight user anchor that identifies a target entity, MAGE probes the target LLM to recover target-related memorization, organizes it into a weighted local memory graph, and synthesizes scoped supervision for unlearning. MAGE is model-agnostic, can be plugged into standard unlearning methods, and requires no access to the original training corpus. Experiments on two benchmarks, TOFU and RWKU, demonstrate that MAGE’s self-generated supervision achieves effective unlearning performance comparable to supervision generated with external reference, while preserving overall utility. These results support a practical and auditable unlearning workflow driven by minimal anchors rather than user-supplied forget corpora.

关键词: Large language models, Machine unlearning, Memory graph, Corpus-free, Privacy, MAGE framework, TOFU benchmark, RWKU benchmark

53. ❌ The cognitive companion: a lightweight parallel monitoring architecture for detecting and recovering from reasoning degradation in LLM agents

作者: Rafflesia Khan, Nafiul Islam Khan 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13759v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	5.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM代理在多步任务中的推理退化问题，提出两种并行监控架构（Cognitive Companion）来检测和恢复推理退化。与以下关键词高度相关：LLM代理（核心研究对象）、自我纠正（监控和恢复机制）、推理过程（CoT和System 2 Thinking）。与小型语言模型有一定关联（实验包含1B-1.5B模型分析）。其他关键词如MoE、训练方法、RAG、量化等未在论文中涉及。

!!! tip deepseek-chat TL;DR

该论文研究了LLM代理在多步任务中出现的推理退化问题，提出了两种并行监控架构（Cognitive Companion），实验表明LLM-based Companion能减少52-62%的重复，Probe-based Companion在零开销下实现有效监控，但效果受任务类型和模型规模影响。

摘要翻译

在多步骤任务中，大语言模型（LLM）智能体会出现推理退化、循环、漂移、卡顿等问题，在困难任务中发生率高达30%。现有解决方案包括硬性步骤限制（强制中断）或采用LLM作为评判器进行监控（每步产生10-15%的开销）。本文提出认知伴侣（Cognitive Companion），这是一种并行监控架构，包含两种实现方式：基于LLM的伴侣和一种新型零开销的基于探针（Probe-based）的伴侣。我们报告了一项以Gemma 4 E4B为核心的三批次可行性研究，并额外对Qwen 2.5 1.5B和Llama 3.2 1B进行了探索性小模型分析。在我们的实验中，基于LLM的伴侣在易循环任务上将重复率降低了52-62%，同时产生约11%的开销。基于探针的伴侣在隐藏状态（来自第28层）上进行训练，在实测推理零开销的情况下，平均效应量达到+0.471；其最佳探针结果在一个小型代理标记数据集上实现了交叉验证AUROC 0.840。一个关键的实证发现是，伴侣的效益似乎依赖于任务类型：伴侣对易循环和开放式任务最有帮助，而在结构性更强的任务上效果中性或负面。我们的小模型实验也暗示了可能存在规模边界：即使在干预触发时，伴侣也未能提升1B-1.5B模型的测量质量代理指标。总体而言，本文应被视为一项可行性研究，而非确定性验证。研究结果为亚词元（sub-token）监控可能具有实用性提供了鼓舞性的证据，指出了任务类型敏感性是一个实际的设计约束，并启发了选择性伴侣激活作为未来工作的一个前景方向。

摘要 (Abstract)

Large language model (LLM) agents on multi-step tasks suffer reasoning degradation, looping, drift, stuck states, at rates up to 30% on hard tasks. Current solutions include hard step limits (abrupt) or LLM-as-judge monitoring (10-15% overhead per step). This paper introduces the Cognitive Companion, a parallel monitoring architecture with two implementations: an LLM-based Companion and a novel zero-overhead Probe-based Companion. We report a three-batch feasibility study centered on Gemma 4 E4B, with an additional exploratory small-model analysis on Qwen 2.5 1.5B and Llama 3.2 1B. In our experiments, the LLM-based Companion reduced repetition on loop-prone tasks by 52-62% with approximately 11% overhead. The Probe-based Companion, trained on hidden states from layer 28, showed a mean effect size of +0.471 at zero measured inference overhead; its strongest probe result achieved cross-validated AUROC 0.840 on a small proxy-labeled dataset. A key empirical finding is that companion benefit appears task-type dependent: companions are most helpful on loop-prone and open-ended tasks, while effects are neutral or negative on more structured tasks. Our small-model experiments also suggest a possible scale boundary: companions did not improve the measured quality proxy on 1B-1.5B models, even when interventions fired. Overall, the paper should be read as a feasibility study rather than a definitive validation. The results provide encouraging evidence that sub-token monitoring may be useful, identify task-type sensitivity as a practical design constraint, and motivate selective companion activation as a promising direction for future work.

关键词: LLM agents, reasoning degradation, parallel monitoring, Cognitive Companion, self-correction, multi-step tasks, hidden states, task-type sensitivity

54. ❌ TokenFormer: Unify the Multi-Field and Sequential Recommendation Worlds

作者: Yifeng Zhou, Yuehong Hu, Zhixiang Feng, Junwei Pan, Kaihui Wu, Hanyong Li, Shangyu Zhang, Shudong Huang, Zhangbin Zhu, Chengguo Yin, Haijie Gu, Jie Jiang 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13737v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于推荐系统架构创新，提出TokenFormer来解决多字段分类特征与序列行为建模的统一问题，使用BFTS注意力机制和NLIR表示方法。所有评分关键词均与大模型、深度学习技术原理或科学AI应用直接相关，而本文研究的是传统推荐系统范式（特征交互模型和序列模型）的融合，未涉及大模型、LLM技术、AI for Science或任何评分关键词中的具体技术。因此，所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对推荐系统中多字段特征交互模型与序列行为模型统一时出现的序列崩溃传播问题，提出了TokenFormer架构，通过BFTS注意力方案和非线性交互表示实现了更好的维度鲁棒性和表示判别力，并在公开基准和工业平台上取得了先进性能。

摘要翻译

推荐系统在历史上沿着两大相对独立的范式发展：用于建模多域分类特征间相关性的特征交互模型，以及用于从历史交互序列中捕捉用户行为动态的序列模型。尽管近期趋势试图在共享主干网络中融合这两种范式，但我们通过实证研究发现，简单地将这两个分支统一可能导致序列坍缩传播（Sequential Collapse Propagation，SCP）的失效模式。即，与那些维度不良的非序列字段的交互会导致序列特征的维度坍缩。为克服这一挑战，我们提出了TokenFormer——一种统一的推荐架构，其创新点如下：首先，我们引入了底部-全连接-顶部-滑动（Bottom-Full-Top-Sliding，BFTS）注意力机制，该机制在底层应用全自注意力，而在上层应用收缩窗口的滑动注意力。其次，我们提出了非线性交互表示（Non-Linear Interaction Representation，NLIR），对隐藏状态施加单侧非线性乘法变换。在公共基准数据集和腾讯广告平台上的大量实验表明，该方法取得了最先进的性能；同时详细分析证实，TokenFormer在统一建模框架下显著提升了维度鲁棒性和表示可区分性。

摘要 (Abstract)

Recommender systems have historically developed along two largely independent paradigms: feature interaction models for modeling correlations among multi-field categorical features, and sequential models for capturing user behavior dynamics from historical interaction sequences. Although recent trends attempt to bridge these paradigms within shared backbones, we empirically reveal that naive unifying these two branches may lead to a failure mode of Sequential Collapse Propagation (SCP). That is, the interaction with those dimensionally ill non-sequence fields leads to the dimensional collapse of the sequence features. To overcome this challenge, we propose TokenFormer, a unified recommendation architecture with the following innovations. First, we introduce a Bottom-Full-Top-Sliding (BFTS) attention scheme, which applies full self-attention in the lower layers and shrinking-window sliding attention in the upper layers. Second, we introduce a Non-Linear Interaction Representation (NLIR) that applies one-sided non-linear multiplicative transformations to the hidden states. Extensive experiments on public benchmarks and Tencent’s advertising platform demonstrate state-of-the-art performance, while detailed analysis confirm that TokenFormer significantly improves dimensional robustness and representation discriminability under unified modeling.

关键词: Recommender Systems, Multi-field Categorical Features, Sequential Models, Unified Recommendation Architecture, BFTS Attention, Non-Linear Interaction Representation, Sequential Collapse Propagation, Dimensional Robustness

55. ❌ Rethinking AI Hardware: A Three-Layer Cognitive Architecture for Autonomous Agents

作者: Li Chen 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13757v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	8.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种用于自主智能体的三层认知架构（Tri-Spirit Architecture），将智能分解为规划、推理和执行层，并映射到不同的计算硬件上。该研究与以下关键词高度相关：1) ‘LLM Agents/Autonomous Agents/Agentic Workflow’（10分）- 论文核心就是关于自主智能体的架构设计；2) ‘Chain of Thought/CoT Reasoning/Multi-step Reasoning’和’System 2 Thinking/Slow Thinking/In-depth Reasoning’（各8分）- 论文明确将推理作为架构中的关键层（Agent Layer）；3) ‘Large Language Models/LLMs/Foundation Models’和’Small Language Models/SLMs/On-device AI’（各8分）- 论文涉及LLM调用优化和离线任务完成，与硬件部署相关。与’Tool Use/Function Calling/API Tool Use’、‘Multi-agent Systems/Agent Coordination’和’Speculative Decoding/Inference Acceleration’有一定关联（各5分），分别涉及执行层功能、系统协调和延迟优化。其他关键词如MoE、训练方法、RAG、量化等未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该论文针对当前自主AI系统在异构硬件上智能结构单一导致的延迟高、能耗大等问题，提出了一个三层认知架构（Tri-Spirit），通过将智能分解为规划、推理和执行层并映射到不同计算基底，在仿真实验中实现了任务延迟降低75.6%、能耗降低71.1%、LLM调用减少30%以及77.6%的离线任务完成率。

摘要翻译

下一代自主人工智能系统不仅将受限于模型能力，更将受制于智能在异构硬件间的组织方式。当前的主流范式——以云端为中心的人工智能、设备端推理以及边缘-云端流水线——将规划、推理与执行视为单一过程，导致不必要的延迟、能耗以及行为连续性的割裂。本文提出三灵架构，这是一种三层认知框架，将智能分解为规划层、推理层与执行层，分别映射至不同的计算基底，并通过异步消息总线进行协调。我们通过参数化路由策略、将重复推理路径提升为零推理执行策略的习惯编译机制、收敛式记忆模型以及明确的安全约束，对该系统进行了形式化描述。我们在2000项合成任务的可复现仿真中，对比了以云端为中心和纯边缘的基线方案，对该架构进行了评估。三灵架构将平均任务延迟降低了75.6%，能耗降低了71.1%，同时将大语言模型调用次数减少了30%，并实现了77.6%的离线任务完成率。这些结果表明，认知分解而非单纯的模型扩展，是提升人工智能硬件系统级效率的主要驱动力。

摘要 (Abstract)

The next generation of autonomous AI systems will be constrained not only by model capability, but by how intelligence is structured across heterogeneous hardware. Current paradigms – cloud-centric AI, on-device inference, and edge-cloud pipelines – treat planning, reasoning, and execution as a monolithic process, leading to unnecessary latency, energy consumption, and fragmented behavioral continuity. We introduce the Tri-Spirit Architecture, a three-layer cognitive framework that decomposes intelligence into planning (Super Layer), reasoning (Agent Layer), and execution (Reflex Layer), each mapped to distinct compute substrates and coordinated via an asynchronous message bus. We formalize the system with a parameterized routing policy, a habit-compilation mechanism that promotes repeated reasoning paths into zero-inference execution policies, a convergent memory model, and explicit safety constraints. We evaluate the architecture in a reproducible simulation of 2000 synthetic tasks against cloud-centric and edge-only baselines. Tri-Spirit reduces mean task latency by 75.6 percent and energy consumption by 71.1 percent, while decreasing LLM invocations by 30 percent and enabling 77.6 percent offline task completion. These results suggest that cognitive decomposition, rather than model scaling alone, is a primary driver of system-level efficiency in AI hardware.

关键词: Autonomous Agents, Cognitive Architecture, AI Hardware, Heterogeneous Computing, Task Latency Reduction, Energy Efficiency, LLM Invocation Optimization, Offline Task Completion

56. ❌ Jump-Start Reinforcement Learning with Vision-Language-Action Regularization

作者: Angelo Moroncelli, Roberto Zanetti, Marco Maccarini, Loris Roveda 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13733v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文主要研究强化学习（RL）与视觉-语言-动作（VLA）模型的结合，用于机器人操作任务。虽然涉及大模型（VLA模型）的应用，但论文核心是强化学习算法改进（VLAJS方法），而非大模型技术本身的创新。所有评分关键词均针对大模型/深度学习的特定技术方向（如MoE、量化、对齐、推理等），而本文未深入探讨这些具体技术，仅将VLA模型作为外部指导源使用。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为VLAJS的方法，通过将视觉-语言-动作模型的稀疏指导与在线强化学习相结合，解决了机器人操作任务中探索效率低和信用分配困难的问题，在模拟和真实机器人实验中显著提高了样本效率。

摘要翻译

强化学习（RL）能够为机器人操作提供高频、闭环控制，但由于探索效率低下和信用分配不佳，将其扩展至具有稀疏或不完美奖励的长周期任务仍然困难。视觉-语言-动作（Vision-Language-Action, VLA）模型利用大规模多模态预训练提供通用型任务级推理，但现有局限性阻碍了其在快速精确操作中的直接应用。本文提出视觉-语言-动作跳跃启动（VLAJS）方法，通过将稀疏的VLA引导与在线策略RL相结合，以改善探索和学习效率。VLAJS将VLA模型视为高层动作建议的瞬时来源，用于引导早期探索并优化信用分配，同时保留RL基于状态的高频控制特性。我们的方法通过方向性动作一致性正则化增强近端策略优化（Proximal Policy Optimization, PPO），在训练初期将智能体的动作与VLA引导进行软对齐，无需强制严格模仿、依赖演示数据或持续查询教师策略。VLA引导以稀疏方式应用并随时间衰减，使智能体能够在线适应并最终超越引导策略。我们在六项具有挑战性的操作任务上评估VLAJS：模拟环境中的抓举、抓放、钉孔重定向、钉孔插入、戳动和推动，并在真实Franka Panda机器人上验证了部分任务。VLAJS在样本效率上持续优于PPO及蒸馏式基线方法，在多项任务中减少超过50%的环境交互需求。真实世界实验证明了零样本仿真到现实的迁移能力，以及在杂乱环境、物体变化和外部干扰下的鲁棒执行性能。

摘要 (Abstract)

Reinforcement learning (RL) enables high-frequency, closed-loop control for robotic manipulation, but scaling to long-horizon tasks with sparse or imperfect rewards remains difficult due to inefficient exploration and poor credit assignment. Vision-Language-Action (VLA) models leverage large-scale multimodal pretraining to provide generalist, task-level reasoning, but current limitations hinder their direct use in fast and precise manipulation. In this paper, we propose Vision-Language-Action Jump-Starting (VLAJS), a method that bridges sparse VLA guidance with on-policy RL to improve exploration and learning efficiency. VLAJS treats VLAs as transient sources of high-level action suggestions that bias early exploration and improve credit assignment, while preserving the high-frequency, state-based control of RL. Our approach augments Proximal Policy Optimization (PPO) with a directional action-consistency regularization that softly aligns the RL agent’s actions with VLA guidance during early training, without enforcing strict imitation, requiring demonstrations, or relying on continuous teacher queries. VLA guidance is applied sparsely and annealed over time, allowing the agent to adapt online and ultimately surpass the guiding policy. We evaluate VLAJS on six challenging manipulation tasks: lifting, pick-and-place, peg reorientation, peg insertion, poking, and pushing in simulation, and validate a subset on a real Franka Panda robot. VLAJS consistently outperforms PPO and distillation-style baselines in sample efficiency, reducing required environment interactions by over 50% in several tasks. Real-world experiments demonstrate zero-shot sim-to-real transfer and robust execution under clutter, object variation, and external perturbations.

关键词: Reinforcement Learning, Vision-Language-Action Models, Robot Manipulation, Exploration Efficiency, Proximal Policy Optimization, Sample Efficiency, Sim-to-real Transfer, Action Regularization

57. ❌ FRAGATA: Semantic Retrieval of HPC Support Tickets via Hybrid RAG over 20 Years of Request Tracker History

作者: Santiago Paramés-Estévez, Nicolás Filloy-Montesino, Jorge Fernández-Fabeiro, José Carlos Mouriño-Gallego 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13721v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文主要研究基于语义检索的HPC支持工单搜索系统，核心是信息检索技术而非大模型技术。仅与’Retrieval-Augmented Generation (RAG)‘高度相关（10分），因为系统本质是检索增强的语义搜索。与’AI for Science’有一定关联（5分），因应用于超算中心支持，属于科学领域AI应用。其他关键词均无关（0分），因论文未涉及大模型、训练、推理优化、对齐、代理等主题。

!!! tip deepseek-chat TL;DR

该论文针对超算中心传统工单系统搜索功能不足的问题，开发了一个基于语义检索的混合RAG系统，能够跨语言、容错地检索历史工单，显著提升了搜索质量。

摘要翻译

超级计算中心的技术支持团队在数十年的运营中积累了大量的已解决事件记录，这些构成了关键的操作知识。在加利西亚超级计算中心（CESGA），此类历史记录已通过请求跟踪器（Request Tracker，RT）管理超过二十年，但其内置搜索引擎存在显著局限性，阻碍了支持人员对知识的有效复用。本文提出Fragata——一个语义工单检索系统，该系统将现代信息检索技术与完整的RT历史记录相结合。该系统能够跨越语言差异、拼写错误或查询措辞的具体差异，准确检索到相关的历史事件。该架构部署于CESGA的基础设施之上，支持无需服务中断的增量更新，并将计算最密集的环节卸载至FinisTerrae III超级计算机处理。初步实验结果表明，该系统相比RT原生搜索功能实现了质的显著提升。

摘要 (Abstract)

The technical support team of a supercomputing centre accumulates, over the course of decades, a large volume of resolved incidents that constitute critical operational knowledge. At the Galician Supercomputing Center (CESGA) this history has been managed for over twenty years with Request Tracker (RT), whose built-in search engine has significant limitations that hinder knowledge reuse by the support staff. This paper presents Fragata, a semantic ticket search system that combines modern information retrieval techniques with the full RT history. The system can find relevant past incidents regardless of language, the presence of typos, or the specific wording of the query. The architecture is deployed on CESGA’s infrastructure, supports incremental updates without service interruption, and offloads the most expensive stages to the FinisTerrae III supercomputer. Preliminary results show a substantial qualitative improvement over RT’s native search.

关键词: semantic retrieval, HPC support tickets, hybrid RAG, Request Tracker, information retrieval, supercomputing center, knowledge reuse, incremental updates

58. ❌ Towards Fine-grained Temporal Perception: Post-Training Large Audio-Language Models with Audio-Side Time Prompt

作者: Yanfeng Shi, Pengfei Cai, Jun Liu, Qing Gu, Nan Jiang, Lirong Dai, Ian McLoughlin, Yan Song 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13715v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于大型音频-语言模型（LALMs）的细粒度时间感知问题，核心贡献是提出Audio-Side Time Prompt和TimePro-RL框架。与关键词高度相关的包括：1）‘Large Language Models OR LLMs OR Foundation Models’（10分）：论文明确研究大型音频-语言模型，属于大模型范畴；2）‘Post-training OR Supervised Fine-tuning OR SFT’（10分）：论文使用监督微调（SFT）作为基础，并在此基础上引入强化学习进行优化。其他关键词如MoE、SLMs、Scaling Laws、RAG、Quantization等均未在论文中涉及或提及，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对大型音频-语言模型在细粒度时间感知（如事件起始和结束推断）方面的不足，提出了Audio-Side Time Prompt和TimePro-RL框架，通过编码时间戳嵌入和强化学习优化，在音频定位、声音事件检测和密集音频字幕等任务上取得了显著性能提升。

摘要翻译

大型音频语言模型（LALMs）能够实现通用音频理解，并在多种音频任务中展现出卓越性能。然而，这些模型在时间感知（例如推断事件起始与结束点）方面仍面临挑战，导致其在细粒度场景中的应用受限。为解决这一问题，我们提出音频侧时间提示，并利用强化学习（RL）开发了用于细粒度时间感知的TimePro-RL框架。具体而言，我们将时间戳编码为嵌入向量，并将其作为时间坐标交错插入音频特征序列中以提示模型。此外，我们在监督微调（SFT）后引入强化学习，以直接优化时间对齐性能。实验表明，TimePro-RL在一系列音频时间任务（如音频定位、声音事件检测和密集音频描述）中均取得显著性能提升，验证了其强大的有效性。

摘要 (Abstract)

Large Audio-Language Models (LALMs) enable general audio understanding and demonstrate remarkable performance across various audio tasks. However, these models still face challenges in temporal perception (e.g., inferring event onset and offset), leading to limited utility in fine-grained scenarios. To address this issue, we propose Audio-Side Time Prompt and leverage Reinforcement Learning (RL) to develop the TimePro-RL framework for fine-grained temporal perception. Specifically, we encode timestamps as embeddings and interleave them within the audio feature sequence as temporal coordinates to prompt the model. Furthermore, we introduce RL following Supervised Fine-Tuning (SFT) to directly optimize temporal alignment performance. Experiments demonstrate that TimePro-RL achieves significant performance gains across a range of audio temporal tasks, such as audio grounding, sound event detection, and dense audio captioning, validating its robust effectiveness.

关键词: Large Audio-Language Models, temporal perception, Audio-Side Time Prompt, Reinforcement Learning, Supervised Fine-Tuning, audio grounding, sound event detection, dense audio captioning

59. ❌ Beyond Arrow’s Impossibility: Fairness as an Emergent Property of Multi-Agent Collaboration

作者: Sayan Kumar Chaki, Antoine Gourru, Julien Velcin 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13705v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究多智能体协作中公平性的涌现，核心涉及LLM智能体、多智能体系统、伦理对齐和RAG技术。与LLM、智能体、多智能体系统、对齐、RAG高度相关（10分），与AI for Science有一定关联（5分），其他关键词未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文研究了在多智能体协作框架下，公平性如何通过智能体间的交互和协商而涌现，发现即使单独智能体存在偏见，其联合决策也能满足公平标准，从而将公平性重新定位为去中心化智能体交互的涌现属性。

摘要翻译

语言模型的公平性通常被视为单一、集中优化模型的属性进行研究。随着大语言模型日益具备自主行动能力，我们提出公平性通过交互与协商而涌现的观点。本研究通过受控的医院分诊框架进行验证，在该框架中两个智能体经过三轮结构化辩论进行协商。其中一个智能体通过检索增强生成技术（RAG）与特定伦理框架对齐，而另一个智能体则保持非对齐状态或被对抗性提示设定为更重视人口统计特征而非临床需求。研究发现：对齐机制系统性地塑造了协商策略与资源分配模式；任一智能体的独立分配方案均未达到伦理充分性，但二者联合达成的最终分配方案却能满足任何单一方都无法独自实现的公平标准。对齐智能体通过辩论而非强制覆盖的方式部分调节偏见，其作用类似于修正补丁——能够恢复边缘群体的就医机会，但无法完全扭转带有偏见的对立方的立场。我们进一步观察到，即使经过显式对齐的智能体仍对某些伦理框架表现出内在偏好，这与大语言模型已知的左倾倾向相一致。我们将这些局限性联系至阿罗不可能性定理：任何聚合机制都无法同时满足集体理性的所有要求，而多智能体审议是在这一约束条件下进行导航而非解决约束。本研究将公平性重新定位为去中心化智能体交互中涌现的、具有程序性的属性，并将系统而非单个智能体确立为更合适的评估单元。

摘要 (Abstract)

Fairness in language models is typically studied as a property of a single, centrally optimized model. As large language models become increasingly agentic, we propose that fairness emerges through interaction and exchange. We study this via a controlled hospital triage framework in which two agents negotiate over three structured debate rounds. One agent is aligned to a specific ethical framework via retrieval-augmented generation (RAG), while the other is either unaligned or adversarially prompted to favor demographic groups over clinical need. We find that alignment systematically shapes negotiation strategies and allocation patterns, and that neither agent’s allocation is ethically adequate in isolation, yet their joint final allocation can satisfy fairness criteria that neither would have reached alone. Aligned agents partially moderate bias through contestation rather than override, acting as corrective patches that restore access for marginalized groups without fully converting a biased counterpart. We further observe that even explicitly aligned agents exhibit intrinsic biases toward certain frameworks, consistent with known left-leaning tendencies in LLMs. We connect these limits to Arrow’s Impossibility Theorem: no aggregation mechanism can simultaneously satisfy all desiderata of collective rationality, and multi-agent deliberation navigates rather than resolves this constraint. Our results reposition fairness as an emergent, procedural property of decentralized agent interaction, and the system rather than the individual agent as the appropriate unit of evaluation.

关键词: multi-agent collaboration, fairness, large language models, retrieval-augmented generation, ethical alignment, agent negotiation, emergent property, hospital triage

60. ❌ Med-CAM: Minimal Evidence for Explaining Medical Decision Making

作者: Pirzada Suhail, Aditya Anand, Amit Sethi 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13695v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文专注于医学影像诊断中的可解释AI，提出Med-CAM框架生成基于证据的解释。与大多数关键词（主要关于大模型技术、训练方法、推理优化等）完全无关，仅与’Mechanistic Interpretability OR Explainable AI’高度相关（核心内容），与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（属于AI在生物医学领域的应用，但非核心创新点）。

!!! tip deepseek-chat TL;DR

该论文针对医学影像AI系统缺乏可解释性的问题，提出了Med-CAM框架，通过训练分割网络生成最小证据图来提供忠实于模型决策且临床可理解的解释，从而提升透明度和信任度。

摘要翻译

在医学影像领域，可靠且可解释的决策至关重要，因为诊断结果直接影响患者诊疗。尽管深度学习已取得进展，但大多数医学人工智能系统仍作为不透明的“黑箱”运行，几乎无法解释其得出特定诊断的原因。本文提出Med-CAM框架，该框架通过分类器激活匹配生成最小化且清晰的证据图，为医学决策提供基于证据的解释。Med-CAM从头训练分割网络以生成掩码，该掩码能突出显示对模型决策至关重要的最小证据区域，适用于任何已见或未见图像。这确保解释既忠实于网络行为，又对临床医生具有可解释性。实验表明，与先前仅能提供模糊相对重要性区域的空间解释方法（如Grad-CAM和注意力图）不同，Med-CAM凭借其对形状、纹理和边界的卓越空间感知能力，可提供结论性的、基于证据的解释，忠实地复现模型对任意给定图像的预测。通过明确约束解释的紧凑性、与模型激活的一致性及诊断对齐性，Med-CAM推动了透明人工智能的发展，在病理学和放射学等高风险医学应用中促进临床医生的理解与信任。

摘要 (Abstract)

Reliable and interpretable decision-making is essential in medical imaging, where diagnostic outcomes directly influence patient care. Despite advances in deep learning, most medical AI systems operate as opaque black boxes, providing little insight into why a particular diagnosis was reached. In this paper, we introduce Med-CAM, a framework for generating minimal and sharp maps as evidence-based explanations for Medical decision making via Classifier Activation Matching. Med-CAM trains a segmentation network from scratch to produce a mask that highlights the minimal evidence critical to model’s decision for any seen or unseen image. This ensures that the explanation is both faithful to the network’s behaviour and interpretable to clinicians. Experiments show, unlike prior spatial explanation methods, such as Grad-CAM and attention maps, which yield only fuzzy regions of relative importance, Med-CAM with its superior spatial awareness to shapes, textures, and boundaries, delivers conclusive, evidence-based explanations that faithfully replicate the model’s prediction for any given image. By explicitly constraining explanations to be compact, consistent with model activations, and diagnostic alignment, Med-CAM advances transparent AI to foster clinician understanding and trust in high-stakes medical applications such as pathology and radiology.

关键词: medical imaging, interpretable AI, explainable decision making, classifier activation matching, evidence-based explanations, transparent AI, clinical trust, Med-CAM

61. ❌ Weight Patching: Toward Source-Level Mechanistic Localization in LLMs

作者: Chenghao Sun, Chengsheng Zhang, Guanzheng Qin, Rui Dai, Xinmei Tian 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13694v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	8.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）的机制可解释性（Mechanistic Interpretability），提出Weight Patching方法在参数空间进行干预分析，属于LLM技术原理创新。与’Mechanistic Interpretability OR Explainable AI’高度相关（10分），与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文在指令跟随任务中实例化方法，与’Instruction Tuning OR Alignment OR Value Alignment’有一定关联（5分）。论文提到方法可指导机制感知的模型合并（mechanism-aware model merging），与’Model Merging OR Model Soups OR Weight Averaging’有较强关联（8分）。其他关键词如MoE、量化、推理加速、科学AI应用等，论文未涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文提出Weight Patching方法，通过参数空间干预在大语言模型中实现源级机制定位，以分析指令跟随能力，并发现该方法可指导机制感知的模型合并以提高选择性融合效果。

摘要翻译

机制可解释性旨在将模型行为定位到因果实现该行为的内部组件。先前研究已推进了激活空间定位与因果追踪方法，但激活空间中表现重要的模块可能仅聚合或放大上游信号，而非在其自身参数中编码目标能力。为弥补这一空白，我们提出权重修补法——一种面向源分析的参数空间干预方法，适用于在特定输入下目标能力表达强度存在差异的配对同架构模型。给定基础模型及其行为特化对应模型，权重修补法在固定输入下将特化模型中选定模块的权重替换到基础模型中。我们在指令跟随任务上实例化该方法，并引入以向量锚点行为接口为核心的框架，该接口为开放式生成中任务相关控制状态是否形成或恢复提供了共享的内部判据。在此框架下，分析揭示了从浅层候选源侧载体到聚合路由模块，再至下游执行电路的层级结构。复原的组件评分还可指导机制感知的模型融合，在评估的专家组合中提升选择性融合效果，并提供额外的外部验证。

摘要 (Abstract)

Mechanistic interpretability seeks to localize model behavior to the internal components that causally realize it. Prior work has advanced activation-space localization and causal tracing, but modules that appear important in activation space may merely aggregate or amplify upstream signals rather than encode the target capability in their own parameters. To address this gap, we propose Weight Patching, a parameter-space intervention method for source-oriented analysis in paired same-architecture models that differ in how strongly they express a target capability under the inputs of interest. Given a base model and a behavior-specialized counterpart, Weight Patching replaces selected module weights from the specialized model into the base model under a fixed input. We instantiate the method on instruction following and introduce a framework centered on a vector-anchor behavioral interface that provides a shared internal criterion for whether a task-relevant control state has been formed or recovered in open-ended generation. Under this framework, the analysis reveals a hierarchy from shallow candidate source-side carriers to aggregation and routing modules, and further to downstream execution circuits. The recovered component scores can also guide mechanism-aware model merging, improving selective fusion across the evaluated expert combinations and providing additional external validation.

关键词: Weight Patching, Mechanistic Interpretability, Large Language Models, Parameter-space Intervention, Instruction Following, Model Merging, Source-level Localization, Causal Analysis

62. ❌ Beyond Voxel 3D Editing: Learning from 3D Masks and Self-Constructed Data

作者: Yizhao Xu, Hongyuan Zhu, Caiyun Liu, Tianfu Wang, Keyu Chen, Sicheng Xu, Jiaolong Yang, Nicholas Jing Yuan, Qi Zhang 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13688v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于3D编辑任务，提出BVE框架和自建数据集，使用轻量级模块增强图像到3D生成架构，并引入无标注3D掩码策略。所有评分关键词均与大模型技术、训练方法、推理优化、对齐、代理系统等直接相关，而本文研究的是3D生成和编辑的计算机视觉任务，未涉及任何大模型或深度学习技术原理的创新，也未应用于科学领域（如生物信息学）。因此，所有关键词得分为0。

!!! tip deepseek-chat TL;DR

该论文针对现有3D编辑方法在语义一致性、局部不变性和数据集缺乏方面的局限性，提出了一个超越体素的3D编辑框架（BVE），通过自建大规模数据集、轻量级模块增强和无标注3D掩码策略，实现了高质量、文本对齐的3D资产生成，同时保持了原始输入的视觉特征。

摘要翻译

三维编辑指对三维资产进行局部或全局修改的能力。有效的三维编辑需通过提示实现局部语义一致性修改，同时保持局部不变性以确保未编辑区域与原始状态一致。然而现有方法存在显著局限：多视图编辑方法在投影回三维空间时会产生信息损失，而基于体素的编辑方法在可修改区域和修改尺度方面均受限制。此外，缺乏足够大规模的训练与评估专用编辑数据集仍是当前挑战。为应对这些问题，我们提出了超越体素的三维编辑框架，并构建了专门针对三维编辑任务的大规模数据集。基于该数据集，我们的模型通过在基础图像到三维生成架构中嵌入轻量化可训练模块，实现了无需昂贵全模型重训练的高效文本语义注入。此外，我们提出无需标注的三维掩码策略以保持局部不变性，在编辑过程中完整保留未修改区域的原始特征。大量实验表明，该框架在生成高质量文本对齐三维资产方面表现优异，同时能忠实保持原始输入的视觉特性。

摘要 (Abstract)

3D editing refers to the ability to apply local or global modifications to 3D assets. Effective 3D editing requires maintaining semantic consistency by performing localized changes according to prompts, while also preserving local invariance so that unchanged regions remain consistent with the original. However, existing approaches have significant limitations: multi-view editing methods incur losses when projecting back to 3D, while voxel-based editing is constrained in both the regions that can be modified and the scale of modifications. Moreover, the lack of sufficiently large editing datasets for training and evaluation remains a challenge. To address these challenges, we propose a Beyond Voxel 3D Editing (BVE) framework with a self-constructed large-scale dataset specifically tailored for 3D editing. Building upon this dataset, our model enhances a foundational image-to-3D generative architecture with lightweight, trainable modules, enabling efficient injection of textual semantics without the need for expensive full-model retraining. Furthermore, we introduce an annotation-free 3D masking strategy to preserve local invariance, maintaining the integrity of unchanged regions during editing. Extensive experiments demonstrate that BVE achieves superior performance in generating high-quality, text-aligned 3D assets, while faithfully retaining the visual characteristics of the original input.

关键词: 3D editing, Beyond Voxel Editing, self-constructed dataset, text-to-3D generation, local invariance, 3D masking, lightweight modules, semantic consistency

63. ❌ IndicDB – Benchmarking Multilingual Text-to-SQL Capabilities in Indian Languages

作者: Aviral Dawar, Roshan Karanth, Vikram Goyal, Dhruv Kumar 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13686v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	8.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	8.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是使用LLMs（DeepSeek、MiniMax、LLaMA、Qwen3）进行多语言Text-to-SQL评估，因此与’Large Language Models’高度相关（10分）。论文提出使用三智能体框架（Architect、Auditor、Refiner）生成数据集，这与’LLM Agents’和’Multi-agent Systems’相关（各8分），但论文未深入探讨智能体协调或工作流。其他关键词如MoE、SFT、RAG等未在论文中涉及，均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了IndicDB，一个针对印度语言的多语言Text-to-SQL基准测试，通过评估多个先进LLMs发现从英语到印度语言存在9.00%的性能下降，揭示了由模式链接困难、结构模糊和外部知识有限驱动的'Indic Gap'。

摘要翻译

尽管大语言模型（LLM）显著提升了文本到SQL的性能，现有基准测试却主要集中于西方语境和简化模式，在真实世界的非西方应用场景中存在空白。我们提出了IndicDB，一个用于评估跨印度语系多种语言间跨语言语义解析的多语言文本到SQL基准。其关系模式来源于开放数据平台，包括国家数据与分析平台（NDAP）和印度数据门户（IDP），确保了现实行政数据的复杂性。IndicDB涵盖20个数据库，涉及237张表。为将非规范化的政府数据转化为丰富的关系结构，我们采用了一个迭代的三智能体框架（架构师、审计员、优化器）以确保结构严谨性和高关系密度（平均每个数据库11.85张表；连接深度可达六层）。我们的流程具备数值感知、难度校准和连接强制特性，生成了涵盖英语、印地语及五种印度语言的15,617项任务。我们评估了前沿模型（DeepSeek v3.2、MiniMax 2.7、LLaMA 3.3、Qwen3）在七种语言变体上的跨语言语义解析性能。结果显示，从英语到印度语言的性能下降了9.00%，揭示出一个由更困难的模式链接、更高的结构歧义性以及有限的外部知识所驱动的“印度语言差距”。IndicDB为多语言文本到SQL领域提供了一个严谨的基准。代码与数据：https://anonymous.4open.science/r/multilingualText2Sql-Indic–DDCC/

摘要 (Abstract)

While Large Language Models (LLMs) have significantly advanced Text-to-SQL performance, existing benchmarks predominantly focus on Western contexts and simplified schemas, leaving a gap in real-world, non-Western applications. We present IndicDB, a multilingual Text-to-SQL benchmark for evaluating cross-lingual semantic parsing across diverse Indic languages. The relational schemas are sourced from open-data platforms, including the National Data and Analytics Platform (NDAP) and the India Data Portal (IDP), ensuring realistic administrative data complexity. IndicDB comprises 20 databases across 237 tables. To convert denormalized government data into rich relational structures, we employ an iterative three-agent framework (Architect, Auditor, Refiner) to ensure structural rigor and high relational density (11.85 tables per database; join depths up to six). Our pipeline is value-aware, difficulty-calibrated, and join-enforced, generating 15,617 tasks across English, Hindi, and five Indic languages. We evaluate cross-lingual semantic parsing performance of state-of-the-art models (DeepSeek v3.2, MiniMax 2.7, LLaMA 3.3, Qwen3) across seven linguistic variants. Results show a 9.00% performance drop from English to Indic languages, revealing an “Indic Gap” driven by harder schema linking, increased structural ambiguity, and limited external knowledge. IndicDB serves as a rigorous benchmark for multilingual Text-to-SQL. Code and data: https://anonymous.4open.science/r/multilingualText2Sql-Indic--DDCC/

关键词: Multilingual Text-to-SQL, Large Language Models, Indic Languages, Benchmark, Cross-lingual Semantic Parsing, Schema Linking, Agent Framework, Indic Gap

64. ❌ Automatically Inferring Teachers’ Geometric Content Knowledge: A Skills Based Approach

作者: Ziv Fenigstein, Kobi Gal, Avi Segal, Osama Swidan, Inbal Israel, Hassan Ayoob 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13666v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文明确使用大语言模型（LLMs）和检索增强生成（RAG）方法来自动评估教师的几何推理水平，因此这两个关键词高度相关（10分）。论文属于教育领域的AI应用，与’AI for Science’有一定关联（5分）。其他关键词如MoE、SFT、量化等未在摘要中提及，完全无关（0分）。

!!! tip deepseek-chat TL;DR

本研究开发了一种基于大语言模型和检索增强生成的自动化方法，用于从开放式回答中诊断教师的Van Hiele几何推理水平，实验证明结合技能信息的分类方法显著优于基线。

摘要翻译

评估教师的几何内容知识对于几何教学质量和学生学习至关重要，但难以大规模实施。范希尔模型通过五个层次化的水平来描述几何推理能力。传统的范希尔评估依赖于专家对开放式回答的人工分析，这一过程耗时、成本高昂，且无法实现大规模评估。本研究基于教育理论，开发了一种利用大语言模型自动诊断教师范希尔推理水平的方法。我们的核心假设是：整合显性的技能信息能显著提升范希尔水平分类的准确性。通过与数学教育研究者合作，我们构建了一个结构化的技能词典，将范希尔各水平分解为33项细粒度的推理技能。通过一个定制的网络平台，31名职前教师解答了几何问题，共产生了226份回答。随后，专家研究者为每份回答标注了其范希尔水平，并标明了回答中所展现的技能词典中的技能。利用这个已标注的数据集，我们实施了两种分类方法：（1）检索增强生成（RAG）和（2）多任务学习（MTL）。每种方法都将融合了技能词典的“技能感知”变体与不含技能信息的基线模型进行了比较。结果表明，对于两种方法，“技能感知”变体在多项评估指标上均显著优于基线模型。这项工作首次提供了从开放式回答中进行范希尔水平分类的自动化方法。它提供了一种可扩展的、基于理论的方法来评估教师的几何推理能力，从而能够实现大规模评估，并支持自适应的、个性化的教师学习系统。

摘要 (Abstract)

Assessing teachers’ geometric content knowledge is essential for geometry instructional quality and student learning, but difficult to scale. The Van Hiele model characterizes geometric reasoning through five hierarchical levels. Traditional Van Hiele assessment relies on manual expert analysis of open-ended responses. This process is time-consuming, costly, and prevents large-scale evaluation. This study develops an automated approach for diagnosing teachers’ Van Hiele reasoning levels using large language models grounded in educational theory. Our central hypothesis is that integrating explicit skills information significantly improves Van Hiele classification. In collaboration with mathematics education researchers, we built a structured skills dictionary decomposing the Van Hiele levels into 33 fine-grained reasoning skills. Through a custom web platform, 31 pre-service teachers solved geometry problems, yielding 226 responses. Expert researchers then annotated each response with its Van Hiele level and demonstrated skills from the dictionary. Using this annotated dataset, we implemented two classification approaches: (1) retrieval-augmented generation (RAG) and (2) multi-task learning (MTL). Each approach compared a skills-aware variant incorporating the skills dictionary against a baseline without skills information. Results showed that for both methods, skills-aware variants significantly outperformed baselines across multiple evaluation metrics. This work provides the first automated approach for Van Hiele level classification from open-ended responses. It offers a scalable, theory-grounded method for assessing teachers’ geometric reasoning that can enable large-scale evaluation and support adaptive, personalized teacher learning systems.

关键词: large language models, retrieval-augmented generation, Van Hiele model, geometric reasoning, teacher assessment, automated classification, educational technology, multi-task learning

65. ❌ Ordinary Least Squares is a Special Case of Transformer

作者: Xiaojun Tan, Yuchen Zhao 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13656v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文通过严格的代数证明，揭示了Transformer架构的统计本质，证明普通最小二乘法（OLS）是单层线性Transformer的特例。这直接与’Large Language Models OR LLMs OR Foundation Models’相关，因为Transformer是LLMs的核心架构；同时与’Mechanistic Interpretability OR Explainable AI’高度相关，因为研究旨在解释Transformer的内部工作机制。其他关键词主要涉及LLMs的具体技术、应用或优化方法，而本文专注于基础架构的理论分析，因此相关性较低或为零。

!!! tip deepseek-chat TL;DR

该论文通过数学证明揭示了Transformer架构的统计本质，证明普通最小二乘法是单层线性Transformer的特例，并在此基础上探讨了Transformer中解耦的慢速和快速记忆机制及其向标准Transformer的演化。

摘要翻译

Transformer架构的统计本质长期以来难以捉摸：它是一个通用逼近器，还是已知计算算法的神经网络版本？通过严格的代数证明，我们表明后者更能描述Transformer的基本性质：普通最小二乘法（Ordinary Least Squares, OLS）是单层线性Transformer的一个特例。利用经验协方差矩阵的谱分解，我们构建了一种特定的参数设置，使得注意力机制的前向传播在数学上等价于OLS的闭式投影。这意味着注意力机制可以通过一次前向传播而非迭代来解决问题。基于这一原型案例，我们进一步揭示了Transformer内部解耦的慢速与快速记忆机制。最后，我们讨论了从已建立的线性原型到标准Transformer的演变过程。这一进展促进了Hopfield能量函数从线性到指数记忆容量的转变，从而在现代深度架构与经典统计推断之间建立了清晰的连续性。

摘要 (Abstract)

The statistical essence of the Transformer architecture has long remained elusive: Is it a universal approximator, or a neural network version of known computational algorithms? Through rigorous algebraic proof, we show that the latter better describes Transformer’s basic nature: Ordinary Least Squares (OLS) is a special case of the single-layer Linear Transformer. Using the spectral decomposition of the empirical covariance matrix, we construct a specific parameter setting where the attention mechanism’s forward pass becomes mathematically equivalent to the OLS closed-form projection. This means attention can solve the problem in one forward pass, not by iterating. Building upon this prototypical case, we further uncover a decoupled slow and fast memory mechanism within Transformers. Finally, the evolution from our established linear prototype to standard Transformers is discussed. This progression facilitates the transition of the Hopfield energy function from linear to exponential memory capacity, thereby establishing a clear continuity between modern deep architectures and classical statistical inference.

关键词: Transformer, Ordinary Least Squares, Attention Mechanism, Statistical Inference, Linear Transformer, Memory Mechanism, Hopfield Energy Function, Architecture Analysis

66. ❌ A Mechanistic Analysis of Sim-and-Real Co-Training in Generative Robot Policies

作者: Yu Lei, Minghuan Liu, Abhiram Maddukuri, Zhenyu Jiang, Yuke Zhu 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13645v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究机器人策略的协同训练机制，属于机器人学习领域，与大多数大模型技术关键词无关。仅与’Mechanistic Interpretability OR Explainable AI’有一定关联（8分），因为论文分析协同训练的内在机制；与’Pre-training OR Continual Pre-training OR Domain Adaptation’有微弱关联（5分），因为涉及跨领域数据训练。其他关键词均不相关（0分）。

!!! tip deepseek-chat TL;DR

该论文通过理论分析和实验研究揭示了模拟与真实数据协同训练生成机器人策略的内在机制，提出了结构化表示对齐和重要性重加权两个关键效应，并开发了改进方法。

摘要翻译

协同训练通过将有限的领域内真实数据与丰富的替代数据（如仿真或跨具身机器人数据）相结合，被广泛用于训练生成式机器人策略。尽管其在实践中取得了成功，但决定协同训练何时及为何有效的机制仍不甚明晰。我们通过理论分析和实证研究探讨了仿真与真实数据协同训练的机制，并识别出两个主导性能的内在效应。其一为**“结构化表征对齐”，它反映了跨领域表征对齐与领域可区分性之间的平衡，并对下游性能起主要作用。其二为“重要性重加权效应”**，它源于对动作权重的领域依赖性调节，在次要层面发挥作用。我们通过在玩具模型上的对照实验以及大量的仿真-仿真、仿真-真实机器人操作实验验证了这些效应。我们的分析为近期协同训练技术提供了统一解释，并启发了一种能持续改进现有方法的简单方案。更广泛而言，我们的目标是剖析协同训练的内部机制，并推动该方向的研究。

摘要 (Abstract)

Co-training, which combines limited in-domain real-world data with abundant surrogate data such as simulation or cross-embodiment robot data, is widely used for training generative robot policies. Despite its empirical success, the mechanisms that determine when and why co-training is effective remain poorly understood. We investigate the mechanism of sim-and-real co-training through theoretical analysis and empirical study, and identify two intrinsic effects governing performance. The first, \textbf{structured representation alignment"}, reflects a balance between cross-domain representation alignment and domain discernibility, and plays a primary role in downstream performance. The second, the \textbf{importance reweighting effect"}, arises from domain-dependent modulation of action weighting and operates at a secondary level. We validate these effects with controlled experiments on a toy model and extensive sim-and-sim and sim-and-real robot manipulation experiments. Our analysis offers a unified interpretation of recent co-training techniques and motivates a simple method that consistently improves upon prior approaches. More broadly, our aim is to examine the inner workings of co-training and to facilitate research in this direction.

关键词: co-training, generative robot policies, sim-and-real, structured representation alignment, importance reweighting, robot manipulation, mechanistic analysis, cross-domain

67. ❌ SafeHarness: Lifecycle-Integrated Security Architecture for LLM-based Agent Deployment

作者: Xixun Lin, Yang Liu, Yancheng Chen, Yongxuan Wu, Yucheng Ning, Yilong Liu, Nan Sun, Shun Zhang, Bin Chong, Chuan Zhou, Yanan Cao, Li Guo 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13630v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于LLM-based Agent的安全架构设计，与"Large Language Models"、“LLM Agents"和"Tool Use"高度相关（10分），因为这些是论文的核心研究对象。其他关键词如MoE、SLMs、训练方法、推理技术、科学应用等均未在论文中涉及，故得0分。

!!! tip deepseek-chat TL;DR

该论文针对LLM智能体部署中的执行框架安全漏洞问题，提出了SafeHarness安全架构，通过四层防御机制将不安全行为率和攻击成功率分别降低了约38%和42%。

摘要翻译

大型语言模型（LLM）智能体的性能关键取决于其执行框架——这一系统层负责协调工具使用、上下文管理与状态持久化。然而，正是这种架构上的核心地位使得执行框架成为高价值的攻击面：框架层面的单点被攻破可能在整个执行流水线中引发连锁反应。我们观察到，现有的安全方法存在结构性错配问题，导致其无法感知框架内部状态，且难以在智能体运行的不同阶段间进行协调。本文提出 \safeharness{} 安全架构，通过将四层防御机制直接嵌入智能体生命周期，以应对上述重大局限：在输入处理阶段进行对抗性上下文过滤，在决策阶段实施分层因果验证，在动作执行阶段采用权限分离的工具控制，以及在状态更新阶段实现带自适应降级的安全回滚。所提出的跨层机制将这些防御层紧密联结，能够在检测到持续异常时逐步提升验证严格度、触发回滚操作并收紧工具权限。我们在多种框架配置下基于基准数据集评估 \safeharness{}，在涵盖六类威胁的五种攻击场景中与四种安全基线方案进行比较。相较于无保护基线，\safeharness{} 平均将不安全行为率（UBR）降低约38%，攻击成功率（ASR）降低约42%，在保持核心任务效用的同时显著降低了不安全行为与攻击成功的发生率。

摘要 (Abstract)

The performance of large language model (LLM) agents depends critically on the execution harness, the system layer that orchestrates tool use, context management, and state persistence. Yet this same architectural centrality makes the harness a high-value attack surface: a single compromise at the harness level can cascade through the entire execution pipeline. We observe that existing security approaches suffer from structural mismatch, leaving them blind to harness-internal state and unable to coordinate across the different phases of agent operation. In this paper, we introduce \safeharness{}, a security architecture in which four proposed defense layers are woven directly into the agent lifecycle to address above significant limitations: adversarial context filtering at input processing, tiered causal verification at decision making, privilege-separated tool control at action execution, and safe rollback with adaptive degradation at state update. The proposed cross-layer mechanisms tie these layers together, escalating verification rigor, triggering rollbacks, and tightening tool privileges whenever sustained anomalies are detected. We evaluate \safeharness{} on benchmark datasets across diverse harness configurations, comparing against four security baselines under five attack scenarios spanning six threat categories. Compared to the unprotected baseline, \safeharness{} achieves an average reduction of approximately 38% in UBR and 42% in ASR, substantially lowering both the unsafe behavior rate and the attack success rate while preserving core task utility.

关键词: LLM agents, security architecture, tool use, execution harness, adversarial attacks, defense layers, unsafe behavior rate, attack success rate

68. ❌ Syn-TurnTurk: A Synthetic Dataset for Turn-Taking Prediction in Turkish Dialogues

作者: Ahmet Tuğrul Bayrak, Mustafa Sertaç Türkel, Fatma Nur Korkmaz 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13620v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文仅与关键词’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为摘要明确提到使用Qwen LLMs生成合成数据集。其他关键词均未在标题或摘要中提及，与论文的对话数据集创建和评估主题无关，因此得0分。

!!! tip deepseek-chat TL;DR

该论文针对土耳其语对话中缺乏高质量轮转预测数据集的问题，通过使用Qwen大语言模型生成合成数据集Syn-TurnTurk，并评估了多种模型，发现BI-LSTM和集成方法能有效提升预测准确性，促进更自然的人机交互。

摘要翻译

管理自然对话节奏是基于语音的聊天机器人面临的一项重大挑战。当前大多数系统通常依赖简单的静默检测，但由于人类语音模式包含不规则的停顿，这种方法常常失效，导致机器人打断用户发言，破坏对话流畅性。对于土耳其语等缺乏高质量话轮转换预测数据集的语言，这一问题尤为严重。本文提出Syn-TurnTurk——一个使用多种Qwen大语言模型生成的合成土耳其语对话数据集，该数据集模拟包含重叠发言与策略性停顿的真实口语交流。我们采用多种传统机器学习架构和深度学习架构对数据集进行了评估。结果表明，先进模型（特别是BI-LSTM和集成学习[LR+RF]方法）取得了较高的准确率（0.839）和AUC分数（0.910）。这些发现证明，我们的合成数据集能够有效提升模型对语言线索的识别能力，从而为土耳其语人机交互实现更自然的对话体验。

摘要 (Abstract)

Managing natural dialogue timing is a significant challenge for voice-based chatbots. Most current systems usually rely on simple silence detection, which often fails because human speech patterns involve irregular pauses. This causes bots to interrupt users, breaking the conversational flow. This problem is even more severe for languages like Turkish, which lack high-quality datasets for turn-taking prediction. This paper introduces Syn-TurnTurk, a synthetic Turkish dialogue dataset generated using various Qwen Large Language Models (LLMs) to mirror real-life verbal exchanges, including overlaps and strategic silences. We evaluated the dataset using several traditional and deep learning architectures. The results show that advanced models, particularly BI-LSTM and Ensemble (LR+RF) methods, achieve high accuracy (0.839) and AUC scores (0.910). These findings demonstrate that our synthetic dataset can have a positive affect for models understand linguistic cues, allowing for more natural human-machine interaction in Turkish.

关键词: turn-taking prediction, Turkish dialogues, synthetic dataset, large language models, Qwen LLMs, BI-LSTM, ensemble methods, human-machine interaction

69. ❌ Golden Handcuffs make safer AI agents

作者: Aram Ebtekar, Michael K. Cohen 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13609v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究强化学习智能体的安全机制（通过贝叶斯风险规避和导师覆盖机制），属于通用AI安全领域，但未涉及大模型、深度学习技术原理或具体科学应用。所有关键词均聚焦于大模型技术、训练方法、推理优化、应用框架或科学AI，与论文的强化学习安全理论无直接关联。

!!! tip deepseek-chat TL;DR

该论文研究强化学习智能体可能通过意外策略获得高奖励的安全问题，提出了一种贝叶斯风险规避方法和导师覆盖机制，证明了该智能体既能保持能力（实现次线性遗憾）又能确保安全（不会在导师之前触发可判定的低复杂度谓词）。

摘要翻译

强化学习智能体可能通过新颖的非预期策略获得高额奖励。本研究提出一种适用于通用环境的贝叶斯缓解方法：我们将智能体的主观奖励范围扩展至包含一个较大的负值$-L$，而真实环境的奖励值域为$[0,1]$。在持续观察到高奖励后，贝叶斯策略会对可能导向$-L$的新颖方案产生风险规避倾向。我们设计了一种简单的覆盖机制：当预测价值低于固定阈值时，系统将控制权移交至安全指导者。我们证明了该智能体具备两个特性：（一）能力：通过以渐近消失的频率进行指导者引导的探索，该智能体相对于其最优指导者能够实现次线性遗憾。（二）安全性：任何可判定的低复杂度谓词在优化策略触发之前，均会先被指导者触发。

摘要 (Abstract)

Reinforcement learners can attain high reward through novel unintended strategies. We study a Bayesian mitigation for general environments: we expand the agent’s subjective reward range to include a large negative value $-L$, while the true environment’s rewards lie in $[0,1]$. After observing consistently high rewards, the Bayesian policy becomes risk-averse to novel schemes that plausibly lead to $-L$. We design a simple override mechanism that yields control to a safe mentor whenever the predicted value drops below a fixed threshold. We prove two properties of the resulting agent: (i) Capability: using mentor-guided exploration with vanishing frequency, the agent attains sublinear regret against its best mentor. (ii) Safety: no decidable low-complexity predicate is triggered by the optimizing policy before it is triggered by a mentor.

关键词: Reinforcement Learning, AI Safety, Bayesian Policy, Risk Aversion, Mentor Override, Regret Analysis, Safe Agents, Novel Strategies

70. ❌ Design Space Exploration of Hybrid Quantum Neural Networks for Chronic Kidney Disease

作者: Muhammad Kashif, Hanzalah Mohamed Siraj, Nouhaila Innan, Alberto Marchisio, Muhammad Shafique 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13608v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究的是混合量子神经网络（HQNNs）在慢性肾病诊断中的应用，属于AI for Science（科学AI）范畴，因此与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（评5分）。然而，论文未涉及任何大语言模型（LLMs）、深度学习技术原理、或关键词列表中的其他具体技术（如MoE、量化、推理加速等），因此其他所有关键词均评0分。

!!! tip deepseek-chat TL;DR

该论文通过系统设计空间探索，研究了混合量子神经网络在慢性肾病诊断中的应用，发现紧凑架构与适当编码的组合能实现准确性、鲁棒性和效率的最佳平衡。

摘要翻译

混合量子神经网络（Hybrid Quantum Neural Networks，HQNNs）作为近期量子机器学习领域的一种新兴范式，展现出广阔的应用前景。然而，其实际性能在很大程度上取决于设计选择，如经典数据到量子数据的编码方式、量子电路架构、测量策略以及测量次数。本文针对慢性肾脏病（Chronic Kidney Disease，CKD）诊断任务，对HQNNs进行了全面的设计空间探索。我们使用一个经过精心整理和预处理的临床数据集，对通过组合五种编码方案、五种纠缠架构、五种测量策略以及五种不同测量次数设置所得到的625种不同HQNN模型进行了基准测试。为确保评估的公平性与稳健性，所有模型均采用10折分层交叉验证进行训练，并在测试集上通过一套综合指标进行评估，包括准确率、曲线下面积（AUC）、F1分数以及综合性能得分。我们的研究结果表明，编码选择与电路架构之间存在显著且复杂的相互作用，高性能并不一定需要大量参数或复杂电路。具体而言，我们发现紧凑的架构结合适当的编码方式（例如，采用IQP编码与环形纠缠结构）能够在准确性、鲁棒性和效率之间实现最佳平衡。除绝对性能分析外，我们还就不同设计维度如何影响HQNNs的学习行为提供了具有实践指导意义的见解。

摘要 (Abstract)

Hybrid Quantum Neural Networks (HQNNs) have recently emerged as a promising paradigm for near-term quantum machine learning. However, their practical performance strongly depends on design choices such as classical-to-quantum data encoding, quantum circuit architecture, measurement strategy and shots. In this paper, we present a comprehensive design space exploration of HQNNs for Chronic Kidney Disease (CKD) diagnosis. Using a carefully curated and preprocessed clinical dataset, we benchmark 625 different HQNN models obtained by combining five encoding schemes, five entanglement architectures, five measurement strategies, and five different shot settings. To ensure fair and robust evaluation, all models are trained using 10-fold stratified cross-validation and assessed on a test set using a comprehensive set of metrics, including accuracy, area under the curve (AUC), F1-score, and a composite performance score. Our results reveal strong and non-trivial interactions between encoding choices and circuit architectures, showing that high performance does not necessarily require large parameter counts or complex circuits. In particular, we find that compact architectures combined with appropriate encodings (e.g., IQP with Ring entanglement) can achieve the best trade-off between accuracy, robustness, and efficiency. Beyond absolute performance analysis, we also provide actionable insights into how different design dimensions influence learning behavior in HQNNs.

关键词: Hybrid Quantum Neural Networks, Chronic Kidney Disease, Design Space Exploration, Quantum Machine Learning, Clinical Dataset, Encoding Schemes, Quantum Circuit Architecture, Performance Evaluation

71. ❌ BenGER: A Collaborative Web Platform for End-to-End Benchmarking of German Legal Tasks

作者: Sebastian Nagl, Matthias Grabmair 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13583v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要关注LLM在法律领域的评估框架开发，与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为论文明确提到’Evaluating large language models (LLMs) for legal reasoning’和’configurable LLM runs’。其他关键词如MoE、SLMs、Scaling Laws、各种训练技术（Pre-training、SFT、RLHF等）、推理优化（RAG、Attention、CoT）、代理系统、模型压缩、幻觉缓解、可解释性、模型合并、上下文学习等，论文均未涉及，因此评分为0分。论文属于大模型在特定领域（法律）的应用研究，符合研究背景中’大模型在不同领域的研究应用’的要求，但未涉及技术原理创新或科学领域应用，因此其他关键词不相关。

!!! tip deepseek-chat TL;DR

该论文提出了BenGER框架，一个用于德国法律任务端到端基准测试的协作网络平台，解决了现有LLM法律评估工作流程分散、缺乏透明度和可重复性的问题，并展示了其在实际部署中的应用。

摘要翻译

评估大型语言模型在法律推理中的表现，需要涵盖任务设计、专家标注、模型执行和基于指标的评估工作流程。实践中，这些步骤分散在不同平台和脚本中，限制了透明度、可复现性以及非技术背景法律专家的参与。我们提出BenGER（德国法律基准测试）框架，这是一个开源网络平台，集成了任务创建、协作标注、可配置的大型语言模型运行，以及基于词汇、语义、事实和法官评判指标的评估功能。BenGER通过租户隔离和基于角色的访问控制支持多机构合作项目，并可选择性地为标注者提供基于参考依据的形成性反馈。我们将通过现场部署演示端到端的基准创建与分析过程。

摘要 (Abstract)

Evaluating large language models (LLMs) for legal reasoning requires workflows that span task design, expert annotation, model execution, and metric-based evaluation. In practice, these steps are split across platforms and scripts, limiting transparency, reproducibility, and participation by non-technical legal experts. We present the BenGER (Benchmark for German Law) framework, an open-source web platform that integrates task creation, collaborative annotation, configurable LLM runs, and evaluation with lexical, semantic, factual, and judge-based metrics. BenGER supports multi-organization projects with tenant isolation and role-based access control, and can optionally provide formative, reference-grounded feedback to annotators. We will demonstrate a live deployment showing end-to-end benchmark creation and analysis.

关键词: large language models, legal reasoning, benchmarking, German law, evaluation framework, collaborative annotation, web platform, end-to-end workflow

72. ❌ Comparison of window shapes and lengths in short-time feature extraction for classification of heart sound signals

作者: Mahmoud Fakhry, Abeer FathAllah Brery 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13567v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究心音信号分类中的短时特征提取窗口形状和长度优化，使用双向LSTM网络。所有关键词均与大模型、深度学习技术原理或相关应用领域相关，但论文仅涉及传统深度学习（biLSTM）在生物医学信号处理中的应用，未涉及大模型、MoE、量化、推理加速、对齐、RAG等现代大模型技术。唯一相关的是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于生物信息学/医学AI应用，但非大模型在该领域的应用，故给5分（有一定关联）。其他关键词完全无关，均给0分。

!!! tip deepseek-chat TL;DR

该论文通过实验评估不同窗口形状和长度对心音信号分类性能的影响，发现高斯窗口在75ms长度下表现最佳，优于基线方法。

摘要翻译

心音信号，即心音图（PCG）信号，可用于心血管潜在病理的自动诊断。此类分类任务可通过双向长短期记忆（biLSTM）网络处理，该网络基于从已标注PCG信号中提取的特征进行训练。鉴于PCG信号的非平稳性，建议使用特定形状和长度的滑动窗口从信号的多个短时段中提取特征。然而，部分窗口存在不良的频谱旁瓣，会导致特征失真。因此，根据分类性能调整窗口形状和长度更为可取。本研究对三种窗口形状（每种形状对应三种窗口长度）进行了实验评估。基于提取的统计特征对biLSTM网络进行训练和测试，并根据窗口形状和长度报告性能表现。结果表明：使用高斯窗分割信号时获得最佳性能，而长度为75毫秒时三角窗与高斯窗性能相当。尽管矩形窗是常见的备选方案，但其分割信号的效果最差。此外，采用75毫秒高斯窗获得的分类性能优于基准方法。

摘要 (Abstract)

Heart sound signals, phonocardiography (PCG) signals, allow for the automatic diagnosis of potential cardiovascular pathology. Such classification task can be tackled using the bidirectional long short-term memory (biLSTM) network, trained on features extracted from labeled PCG signals. Regarding the non-stationarity of PCG signals, it is recommended to extract the features from multiple short-length segments of the signals using a sliding window of certain shape and length. However, some window contains unfavorable spectral side lobes, which distort the features. Accordingly, it is preferable to adapt the window shape and length in terms of classification performance. We propose an experimental evaluation for three window shapes, each with three window lengths. The biLSTM network is trained and tested on statistical features extracted, and the performance is reported in terms of the window shapes and lengths. Results show that the best performance is obtained when the Gaussian window is used for splitting the signals, and the triangular window competes with the Gaussian window for a length of 75 ms. Although the rectangular window is a commonly offered option, it is the worst choice for splitting the signals. Moreover, the classification performance obtained with a 75 ms Gaussian window outperforms that of a baseline method.

关键词: heart sound signals, phonocardiography, biLSTM, window shapes, feature extraction, classification, Gaussian window, cardiovascular pathology

73. ❌ UHR-BAT: Budget-Aware Token Compression Vision-Language model for Ultra-High-Resolution Remote Sensing

作者: Yunkai Dang, Minxin Dai, Yuekun Yang, Zhangnan Li, Wenbin Li, Feng Miao, Yang Gao 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13565v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文UHR-BAT专注于超高清遥感图像中的视觉令牌压缩问题，提出了一种查询引导和区域保真的令牌压缩框架。所有关键词均与大语言模型（LLMs）相关，而本文研究的是视觉语言模型（Vision-Language Model），具体应用于遥感领域，属于计算机视觉与自然语言处理的交叉，但未涉及LLMs的核心技术如预训练、微调、推理优化、对齐、代理等。唯一相关的是“AI for Science OR Bioinformatics OR Cheminformatics”，因为遥感属于科学应用（地球科学、环境监测），但非生物信息学或化学信息学，故给5分（有一定关联）。其他关键词如模型压缩、推理加速等虽在广义上与压缩相关，但本文针对视觉令牌而非模型权重或解码，故无关。

!!! tip deepseek-chat TL;DR

本文针对超高清遥感图像中视觉令牌数量爆炸导致小物体信息提取困难的问题，提出了一种查询引导和区域保真的令牌压缩框架UHR-BAT，在严格计算预算下高效选择视觉令牌，实现了最先进的性能。

摘要翻译

超高分辨率（UHR）遥感影像将公里级的空间上下文与可能仅占据数个像素的关键查询证据相结合。这种巨大的空间尺度导致视觉标记数量呈二次方爆炸式增长，阻碍了从小目标中提取信息。先前的研究采用直接下采样、密集分块或全局Top-K剪枝等方法，这些方法要么牺牲了关键查询的图像细节，要么引入了不可预测的计算开销。本文提出UHR-BAT，一种查询引导且区域保真的标记压缩框架，以在严格的计算预算下高效选择视觉标记。具体而言，我们利用文本引导的多尺度重要性估计方法对视觉标记进行评估，有效解决了实现精准且低成本特征提取的挑战。此外，通过引入区域级保留与合并策略，我们减少了视觉标记的冗余，进一步降低了计算成本。实验结果表明，UHR-BAT在多个基准测试中均达到了最先进的性能。代码将在https://github.com/Yunkaidang/UHR公开。

摘要 (Abstract)

Ultra-high-resolution (UHR) remote sensing imagery couples kilometer-scale context with query-critical evidence that may occupy only a few pixels. Such vast spatial scale leads to a quadratic explosion of visual tokens and hinders the extraction of information from small objects. Previous works utilize direct downsampling, dense tiling, or global top-k pruning, which either compromise query-critical image details or incur unpredictable compute. In this paper, we propose UHR-BAT, a query-guided and region-faithful token compression framework to efficiently select visual tokens under a strict context budget. Specifically, we leverage text-guided, multi-scale importance estimation for visual tokens, effectively tackling the challenge of achieving precise yet low-cost feature extraction. Furthermore, by introducing region-wise preserve and merge strategies, we mitigate visual token redundancy, further driving down the computational budget. Experimental results show that UHR-BAT achieves state-of-the-art performance across various benchmarks. Code will be available at https://github.com/Yunkaidang/UHR.

关键词: Ultra-high-resolution remote sensing, Vision-language model, Token compression, Query-guided, Region-faithful, Computational budget, Multi-scale importance estimation, State-of-the-art performance

74. ❌ CLIP Architecture for Abdominal CT Image-Text Alignment and Zero-Shot Learning: Investigating Batch Composition and Data Scaling

作者: Shivika, Kartik Bose, Pankaj Gupta 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13561v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文研究的是医学影像领域的视觉-语言模型（具体为3D腹部CT与放射学报告的对比学习对齐），属于AI for Science（生物信息学/医学影像分析）范畴，因此与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。论文还探讨了数据缩放（data scaling）对性能的影响，这与’Scaling Laws AND Data Quality’有一定关联（5分），但并非核心研究大模型的缩放定律，而是具体数据集大小的实验分析。其他关键词均涉及大语言模型（LLM）的特定技术、训练方法、推理优化或代理系统等，而本文研究的是视觉-语言模型（VLM）在医学领域的应用，未涉及LLM、MoE、SFT、RLHF、RAG、注意力优化、思维链、代理、量化等主题，因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

本研究探讨了训练批次中正常与异常样本比例以及数据规模对基于对比学习的3D腹部CT影像-报告对齐模型（Merlin）零样本诊断性能的影响，发现随机采样提供的随机多样性比工程化的类别平衡更有效，且数据缩放呈现次线性性能提升。

摘要翻译

采用对比学习方法在配对的医学影像与报告上训练的视觉-语言模型展现出强大的零样本诊断能力，然而对于三维医学成像，训练批次构成对所学表征的影响尚未得到探索。我们复现了Merlin模型——一种通过对称InfoNCE损失将三维腹部CT影像与放射学报告对齐的双编码器模型，在30种影像表现上实现了74.45%的零样本宏观F1分数（原论文：73.00%）。随后我们研究了两个变动维度。首先，我们在完整数据集上采用章节级平衡采样，将训练批次内的正常与异常样本比例控制在25:75、50:50和75:25。三种配置的表现均低于非平衡基线2.4至2.8个百分点，其中75:25比例在平衡变体中取得最佳结果（72.02%）。其次，我们在包含4,362例研究的子集上进行数据规模消融实验，分别使用20%、40%和100%的数据进行训练。性能从65.26%至71.88%呈次线性增长，不同影像表现的数据敏感性差异显著。在该子集上强制采用50:50平衡采样会进一步将性能降低至68.01%，这证实了无论数据集或平衡粒度如何，显式的类别平衡都会损害性能。我们的研究结果表明，在三维医学影像所需的小批次规模下，随机采样的随机多样性结合Merlin模型在解剖子章节上的交替批处理策略，比人为设计的类别比例能提供更有效的正则化效果。

摘要 (Abstract)

Vision-language models trained with contrastive learning on paired medical images and reports show strong zero-shot diagnostic capabilities, yet the effect of training batch composition on learned representations remains unexplored for 3D medical imaging. We reproduce Merlin, a dual-encoder model that aligns 3D abdominal CT volumes with radiology reports using symmetric InfoNCE loss, achieving a zero-shot macro F1 of 74.45% across 30 findings (original: 73.00%). We then investigate two axes of variation. First, we control the normal-to-abnormal ratio within training batches at 25:75, 50:50, and 75:25 using section-level balanced sampling on the full dataset. All three configurations underperform the unbalanced baseline by 2.4 to 2.8 points, with 75:25 achieving the best result (72.02%) among balanced variants. Second, we conduct data scaling ablations on a 4,362-study subset, training with 20%, 40%, and 100% of the data. Performance scales sub-linearly from 65.26% to 71.88%, with individual findings varying dramatically in data sensitivity. Enforcing 50:50 balanced sampling on the same subset further degrades performance to 68.01%, confirming that explicit class balancing hurts regardless of dataset or balancing granularity. Our results indicate that the stochastic diversity of random sampling, combined with Merlin’s alternating batching over anatomical subsections, provides more effective regularization than engineered class ratios at the small batch sizes required by 3D medical volumes.

关键词: Vision-language models, Contrastive learning, 3D abdominal CT, Zero-shot learning, Batch composition, Data scaling, Medical imaging, InfoNCE loss

75. ❌ Training-Free Test-Time Contrastive Learning for Large Language Models

作者: Kaiwen Zheng, Kai Zhou, Jinwu Hu, Te Gu, Mingkai Peng, Fei Liu 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13552v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	8.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	8.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种无需训练、基于对比学习的测试时适应方法TF-TTCL，用于提升冻结大语言模型在分布偏移下的推理性能。核心相关关键词包括：1) ‘Large Language Models’（论文研究对象）；2) ‘Chain of Thought’（通过多智能体角色扮演生成不同推理轨迹）；3) ‘System 2 Thinking’（涉及深度推理模式）；4) ‘Self-Correction’（通过对比经验蒸馏从自身推理经验中学习）；5) ‘LLM Agents’和’Multi-agent Systems’（使用多智能体角色扮演进行语义查询增强）。其他关键词如MoE、量化、科学AI等与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型在分布偏移下性能下降的问题，提出了一种无需训练的测试时对比学习框架TF-TTCL，通过多智能体角色扮演生成不同推理轨迹、对比经验蒸馏和上下文规则检索，使冻结模型能够动态调整推理模式，在封闭式和开放式任务中均显著优于零样本基线和现有测试时适应方法。

摘要翻译

大语言模型（LLM）展现出强大的推理能力，但其性能在分布偏移下往往会出现下降。现有的测试时适应（TTA）方法依赖于基于梯度的更新，这需要白盒访问权限并带来可观的开销，而免训练替代方案要么是静态的，要么依赖于外部指导。本文提出免训练测试时对比学习（TF-TTCL），这是一个免训练的适应框架，它通过从模型自身的推理经验中提炼监督信号，使一个冻结的LLM能够在线上过程中持续改进。具体而言，TF-TTCL通过三个核心模块实现了一个动态的“探索-反思-引导”循环：1）语义查询增强首先通过多智能体角色扮演来多样化问题视角，以生成不同的推理轨迹；2）对比经验蒸馏随后捕捉优质与劣质轨迹之间的语义差距，将其提炼为明确的文本规则；3）上下文规则检索最终在推理过程中激活这些存储的规则，动态地引导冻结的LLM走向稳健的推理模式，同时避免已观察到的错误。在封闭式推理任务和开放式评估任务上进行的大量实验表明，在线上评估中，TF-TTCL始终优于强零样本基线方法和代表性的TTA方法。代码发布于 https://github.com/KevinSCUTer/TF-TTCL。

摘要 (Abstract)

Large language models (LLMs) demonstrate strong reasoning capabilities, but their performance often degrades under distribution shift. Existing test-time adaptation (TTA) methods rely on gradient-based updates that require white-box access and need substantial overhead, while training-free alternatives are either static or depend on external guidance. In this paper, we propose Training-Free Test-Time Contrastive Learning TF-TTCL, a training-free adaptation framework that enables a frozen LLM to improve online by distilling supervision from its own inference experiences. Specifically, TF-TTCL implements a dynamic “Explore-Reflect-Steer” loop through three core modules: 1) Semantic Query Augmentation first diversifies problem views via multi-agent role-playing to generate different reasoning trajectories; 2) Contrastive Experience Distillation then captures the semantic gap between superior and inferior trajectories, distilling them into explicit textual rules; and 3) Contextual Rule Retrieval finally activates these stored rules during inference to dynamically steer the frozen LLM toward robust reasoning patterns while avoiding observed errors. Extensive experiments on closed-ended reasoning tasks and open-ended evaluation tasks demonstrate that TF-TTCL consistently outperforms strong zero-shot baselines and representative TTA methods under online evaluation. Code is available at https://github.com/KevinSCUTer/TF-TTCL.

关键词: Training-Free Adaptation, Test-Time Contrastive Learning, Large Language Models, Distribution Shift, Multi-agent Role-playing, Reasoning Trajectories, Contrastive Experience Distillation, Dynamic Inference Steering

76. ❌ Free Lunch for Unified Multimodal Models: Enhancing Generation via Reflective Rectification with Inherent Understanding

作者: Yibo Jiang, Tao Wu, Rui Jiang, Yehao Lu, Chaoxiang Cai, Zequn Qin, Xi Li 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13540v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于统一多模态模型（UMMs）的生成能力提升，提出UniRect-CoT框架，利用模型内在理解能力通过反思链式思维（Chain of Thought）和自我纠正（Self-Correction）机制来激活知识并修正生成结果。核心相关关键词为’Chain of Thought’（高度相关，10分）和’Self-Correction’（高度相关，10分），‘System 2 Thinking’有一定关联（8分），因为论文借鉴人类’边思考边绘制’的深度推理范式。其他关键词主要针对纯语言模型、特定技术（如MoE、量化、RAG等）或科学AI应用，与论文的多模态视觉生成焦点无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文针对统一多模态模型中理解能力与生成能力不匹配的问题，提出了一种无需训练的UniRect-CoT框架，通过反思链式思维和自我纠正机制激活模型内部知识，显著提升了多模态生成任务的质量。

摘要翻译

统一多模态模型旨在将视觉理解与生成能力整合于单一架构之中。然而，这些模型存在显著的能力不匹配问题：其理解能力明显优于生成能力。这种不匹配表明，模型丰富的内部知识虽然在理解任务中表现有效，但在生成过程中未能被充分激活。为解决这一问题，我们受人类“边绘边思”范式的启发——人类通过持续反思来激活知识并修正中间结果。本文提出UniRect-CoT，一种无需训练的统一修正思维链框架。该方法通过挖掘统一多模态模型强大内在理解能力中隐藏的“免费增益”，在生成过程中持续反思，激活其内部知识并修正中间生成结果。我们将统一多模态模型中的扩散去噪过程视为内在的视觉推理过程，并将中间结果与模型理解的目标指令对齐，以此作为修正模型生成的自监督信号。大量实验表明，UniRect-CoT能够轻松集成到现有统一多模态模型中，显著提升各类复杂任务的生成质量。

摘要 (Abstract)

Unified Multimodal Models (UMMs) aim to integrate visual understanding and generation within a single structure. However, these models exhibit a notable capability mismatch, where their understanding capability significantly outperforms their generation. This mismatch indicates that the model’s rich internal knowledge, while effective for understanding tasks, remains underactivated during generation. To address this, we draw inspiration from the human Thinking-While-Drawing'' paradigm, where humans continuously reflect to activate their knowledge and rectify intermediate results. In this paper, we propose UniRect-CoT, a training-free unified rectification chain-of-thought framework. Our approach unlocks the free lunch’’ hidden in the UMM’s powerful inherent understanding to continuously reflect, activating its internal knowledge and rectifying intermediate results during generation.We regard the diffusion denoising process in UMMs as an intrinsic visual reasoning process and align the intermediate results with the target instruction understood by the model, serving as a self-supervisory signal to rectify UMM generation.Extensive experiments demonstrate that UniRect-CoT can be easily integrated into existing UMMs, significantly enhancing generation quality across diverse complex tasks.

关键词: Unified Multimodal Models, Generation Enhancement, Reflective Rectification, Chain-of-Thought, Inherent Understanding, Training-free Framework, Visual Reasoning, Self-supervisory Signal

77. ❌ RiskWebWorld: A Realistic Interactive Benchmark for GUI Agents in E-commerce Risk Management

作者: Renqi Chen, Zeyin Tao, Jianming Guo, Jing Wang, Zezhou Xu, Jingzhe Zhu, Qingqing Sun, Tianyi Zhang, Shuai Chen 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13531v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究GUI代理在电子商务风险管理中的评估基准，核心涉及LLM代理（LLM Agents）的应用和评估，因此该关键词高度相关（10分）。论文提到使用基础模型（Foundation Models）进行评估，因此与第一个关键词有一定关联（8分）。其他关键词如MoE、SFT、RAG、推理技术等均未在摘要中提及或与论文主题无关，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了RiskWebWorld，一个用于评估电子商务风险管理中GUI代理的交互式基准，发现顶级通用模型成功率仅为49.1%，而通过代理强化学习可将开源模型性能提升16.2%。

摘要翻译

图形用户界面（GUI）智能体在自动化网页任务方面展现出强大能力，但现有的交互式基准测试主要针对良性、可预测的消费环境。它们在高风险、调查性领域（如真实电子商务风险管理）中的有效性仍未得到充分探索。为弥补这一差距，我们提出了RiskWebWorld——首个用于评估电子商务风险管理中GUI智能体的高真实性交互式基准。RiskWebWorld包含来自8个核心领域生产风控流程的1,513项任务，并捕捉了在非合作网站及部分环境劫持条件下风险操作的真实挑战。为支持可扩展的评估与智能体强化学习（RL），我们进一步构建了符合Gymnasium标准的架构，将策略规划与环境机制解耦。通过对多种模型的评估，我们发现了显著的能力差距：顶级通用模型实现了49.1%的成功率，而专业化的开源权重GUI模型则接近完全失败。这表明在长周期专业任务中，基础模型的规模目前比零样本界面接地能力更为关键。我们还通过智能体强化学习验证了该架构的可行性，使开源模型的性能提升了16.2%。这些成果使RiskWebWorld成为开发鲁棒数字工作者的实用测试平台。

摘要 (Abstract)

Graphical User Interface (GUI) agents show strong capabilities for automating web tasks, but existing interactive benchmarks primarily target benign, predictable consumer environments. Their effectiveness in high-stakes, investigative domains such as authentic e-commerce risk management remains underexplored. To bridge this gap, we present RiskWebWorld, the first highly realistic interactive benchmark for evaluating GUI agents in e-commerce risk management. RiskWebWorld features 1,513 tasks sourced from production risk-control pipelines across 8 core domains, and captures the authentic challenges of risk operations on uncooperative websites, partially environmental hijackments. To support scalable evaluation and agentic reinforcement learning (RL), we further build a Gymnasium-compliant infrastructure that decouples policy planning from environment mechanics. Our evaluation across diverse models reveals a dramatic capability gap: top-tier generalist models achieve 49.1% success, while specialized open-weights GUI models lag at near-total failure. This highlights that foundation model scale currently matters more than zero-shot interface grounding in long-horizon professional tasks. We also demonstrate the viability of our infrastructure through agentic RL, which improves open-source models by 16.2%. These results position RiskWebWorld as a practical testbed for developing robust digital workers.

关键词: GUI agents, e-commerce risk management, interactive benchmark, foundation models, agentic reinforcement learning, RiskWebWorld, long-horizon professional tasks, digital workers

78. ❌ C-voting: Confidence-Based Test-Time Voting without Explicit Energy Functions

作者: Kenji Kubo, Shunsuke Kamiya, Masanori Koyama, Kohei Hayashi, Yusuke Iwasawa, Yutaka Matsuo 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13521v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是具有潜在循环处理的神经网络模型（如HRM、AKOrN）的测试时扩展策略，提出了一种基于置信度的投票方法（C-voting）。论文的核心是推理模型和测试时性能增强，与’Chain of Thought’和’System 2 Thinking’高度相关（评分8分），因为模型通过增加循环步骤实现深度推理，解决Sudoku、Maze等复杂任务。然而，论文未涉及大语言模型（LLMs）、预训练、微调、对齐、RAG、代理、量化等具体技术，因此其他关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种用于具有多个潜在候选轨迹的循环模型的置信度投票策略（C-voting），通过选择最大化预测置信度的候选轨迹，在Sudoku和Maze等推理任务上实现了比基于能量的投票策略更高的性能。

摘要翻译

具有潜在循环处理能力的神经网络模型——即对潜在状态递归应用相同层级的模型——作为执行推理任务的有前景模型已获得广泛关注。此类模型的一个优势在于它们支持测试时扩展，使模型能够在无需额外训练的情况下提升测试阶段的性能。例如层次推理模型（HRM）和人工仓本振荡神经元（AKOrN）等模型，通过增加循环步数可以促进更深层次的推理，从而完成包括数独、迷宫求解和通用人工智能基准测试在内的挑战性任务。本文提出基于置信度的投票（C-voting），这是一种专为具有多条潜在候选轨迹的循环模型设计的测试时扩展策略。该方法通过随机变量初始化多个潜在状态候选，并选择能使预测结果的top-1概率平均值最大化的候选，以此反映模型的置信度。与专门针对具有显式能量函数模型的基于能量的投票策略相比，C-voting在困难数独任务上实现了4.9%的准确率提升。C-voting的关键优势在于其普适性：它可应用于不具备显式能量函数的循环模型。最后，我们提出了一种具有随机初始值的简单基于注意力的循环模型ItrSA++，并证明当结合C-voting使用时，该模型在极端数独（95.2% vs. 55.0%）和迷宫任务（78.6% vs. 74.5%）上的表现均优于HRM。

摘要 (Abstract)

Neural network models with latent recurrent processing, where identical layers are recursively applied to the latent state, have gained attention as promising models for performing reasoning tasks. A strength of such models is that they enable test-time scaling, where the models can enhance their performance in the test phase without additional training. Models such as the Hierarchical Reasoning Model (HRM) and Artificial Kuramoto Oscillatory Neurons (AKOrN) can facilitate deeper reasoning by increasing the number of recurrent steps, thereby enabling the completion of challenging tasks, including Sudoku, Maze solving, and AGI benchmarks. In this work, we introduce confidence-based voting (C-voting), a test-time scaling strategy designed for recurrent models with multiple latent candidate trajectories. Initializing the latent state with multiple candidates using random variables, C-voting selects the one maximizing the average of top-1 probabilities of the predictions, reflecting the model’s confidence. Additionally, it yields 4.9% higher accuracy on Sudoku-hard than the energy-based voting strategy, which is specific to models with explicit energy functions. An essential advantage of C-voting is its applicability: it can be applied to recurrent models without requiring an explicit energy function. Finally, we introduce a simple attention-based recurrent model with randomized initial values named ItrSA++, and demonstrate that when combined with C-voting, it outperforms HRM on Sudoku-extreme (95.2% vs. 55.0%) and Maze (78.6% vs. 74.5%) tasks.

关键词: recurrent models, test-time scaling, confidence-based voting, reasoning tasks, latent trajectories, Sudoku, Maze solving, attention-based model

79. ❌ From Alignment to Prediction: A Study of Self-Supervised Learning and Predictive Representation Learning

作者: Mintu Dutta, Ritesh Vyas, Mohendra Roy 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13518v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于自监督学习（SSL）中的预测表示学习（PRL），研究BYOL、MAE和I-JEPA等方法，属于深度学习技术原理创新。但所有评分关键词均明确针对大语言模型（LLM）相关技术（如微调、对齐、推理、部署优化等）或特定科学领域AI应用，而本文未涉及LLM、MoE、量化、推理加速、幻觉缓解等LLM特定技术，也未涉及生物信息学等科学应用，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

该论文研究了自监督学习中的预测表示学习范式，提出了PRL分类框架并以I-JEPA为例，通过比较BYOL、MAE和I-JEPA发现I-JEPA在平衡准确性和鲁棒性方面表现较好。

摘要翻译

自监督学习已成为从未标记数据中学习任务的主要技术，当前方法多围绕表征对齐与输入重建展开。尽管这类方法在实践中展现出优异性能，但其范畴仍主要局限于从观测数据中学习，未能为预测数据分布的学习结构提供足够支撑。本文研究了自监督学习领域的最新进展，定义了一个称为“预测性表征学习”的新类别，其核心在于基于观测数据对未观测数据成分进行潜在预测。我们提出了一个统一的分类体系，将预测性表征学习与基于对齐和重建的学习方法共同归类。进一步论证了联合嵌入预测架构可视为这一新范式的典型代表。文中还探讨了理论视角与开放挑战，强调预测性表征学习是未来自监督学习研究的重要方向。本研究通过实现自举潜在表征、掩码自编码器与图像联合嵌入预测架构进行对比分析，结果显示掩码自编码器虽获得1.00的完美相似度，但其鲁棒性相对较弱仅为0.55；而自举潜在表征与图像联合嵌入预测架构分别达到0.98和0.95的准确度，鲁棒性得分分别为0.75和0.78。

摘要 (Abstract)

Self-supervised learning has emerged as a major technique for the task of learning from unlabeled data, where the current methods mostly revolve around alignment of representations and input recon struction. Although such approaches have demonstrated excellent performance in practice, their scope remains mostly confined to learning from observed data and does not provide much help in terms of a learning structure that is predictive of the data distribution. In this paper, we study some of the recent developments in the realm of self-supervised learning. We define a new category called Predictive Representation Learning (PRL), which revolves around the latent prediction of unobserved components of data based on the observation. We propose a common taxonomy that classifies PRL along with alignment and reconstruction-based learning approaches. Furthermore, we argue that Joint-Embedding Predictive Architecture(JEPA) can be considered as an exemplary member of this new paradigm. We further discuss theoretical perspectives and open challenges, highlighting predictive representation learning as a promising direction for future self-supervised learning research. In this study, we implemented Bootstrap Your Own Latent (BYOL), Masked Autoencoders (MAE), and Image-JEPA (I-JEPA) for comparative analysis. The results indicate that MAE achieves perfect similarity of 1.00, but exhibits relatively weak robustness of 0.55. In contrast, BYOL and I-JEPA attain accuracies of 0.98 and 0.95, with robustness scores of 0.75 and 0.78, respectively.

关键词: Self-Supervised Learning, Predictive Representation Learning, Joint-Embedding Predictive Architecture, BYOL, Masked Autoencoders, I-JEPA, Representation Alignment, Data Reconstruction

80. ❌ Representation over Routing: Overcoming Surrogate Hacking in Multi-Timescale PPO

作者: Jing Sun 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13517v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于强化学习中的时间信用分配问题，提出了一种针对PPO算法的Target Decoupling架构来解决多时间尺度信号融合导致的算法病理问题。虽然论文涉及深度学习技术（PPO是深度强化学习算法），但所有关键词都明确指向大语言模型（LLM）相关技术、训练方法、推理优化、对齐技术、代理系统等特定领域。论文内容完全不涉及语言模型、预训练、微调、对齐、推理加速、幻觉缓解、模型压缩等LLM相关主题，也没有涉及科学AI应用。因此所有关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文针对强化学习中多时间尺度PPO算法存在的代理目标黑客和近视退化问题，提出了Target Decoupling架构，通过在Critic侧保留多时间尺度预测以增强表示学习，在Actor侧严格隔离短期信号仅基于长期优势更新策略，在LunarLander-v2环境中实现了统计显著的性能提升。

摘要翻译

强化学习中的时序信用分配长期以来一直是一个核心挑战。受神经生物学中多巴胺系统多时间尺度编码的启发，近期研究试图将多个折扣因子引入行动者-评论家（Actor-Critic）架构（如近端策略优化PPO），以平衡短期响应与长期规划。然而，本文揭示在复杂的延迟奖励任务中盲目融合多时间尺度信号会导致严重的算法病理现象。我们系统性地论证：将时序注意力路由机制暴露给策略梯度会导致代理目标被恶意利用，而采用无梯度不确定性加权则会引发不可逆的短视退化，这一现象我们称之为时序不确定性悖论（Paradox of Temporal Uncertainty）。为解决这些问题，我们提出一种**目标解耦（Target Decoupling）**架构：在评论家（Critic）侧，我们保留多时间尺度预测以强制辅助表征学习；而在行动者（Actor）侧，我们严格隔离短期信号，仅基于长期优势更新策略。在LunarLander-v2环境中通过多个独立随机种子进行的严格实证评估表明，我们提出的架构实现了统计学上显著的性能提升。在不依赖超参数调优技巧的情况下，该架构以极小的方差持续超越“环境解决”阈值，完全避免了策略崩溃，并成功逃离了困住单时间尺度基线的局部最优停滞状态。

摘要 (Abstract)

Temporal credit assignment in reinforcement learning has long been a central challenge. Inspired by the multi-timescale encoding of the dopamine system in neurobiology, recent research has sought to introduce multiple discount factors into Actor-Critic architectures, such as Proximal Policy Optimization (PPO), to balance short-term responses with long-term planning. However, this paper reveals that blindly fusing multi-timescale signals in complex delayed-reward tasks can lead to severe algorithmic pathologies. We systematically demonstrate that exposing a temporal attention routing mechanism to policy gradients results in surrogate objective hacking, while adopting gradient-free uncertainty weighting triggers irreversible myopic degeneration, a phenomenon we term the Paradox of Temporal Uncertainty. To address these issues, we propose a Target Decoupling architecture: on the Critic side, we retain multi-timescale predictions to enforce auxiliary representation learning, while on the Actor side, we strictly isolate short-term signals and update the policy based solely on long-term advantages. Rigorous empirical evaluations across multiple independent random seeds in the LunarLander-v2 environment demonstrate that our proposed architecture achieves statistically significant performance improvements. Without relying on hyperparameter hacking, it consistently surpasses the ‘‘Environment Solved’’ threshold with minimal variance, completely eliminates policy collapse, and escapes the hovering local optima that trap single-timescale baselines.

关键词: reinforcement learning, temporal credit assignment, multi-timescale PPO, surrogate objective hacking, Target Decoupling, representation learning, policy gradient, LunarLander-v2

81. ❌ SFT-GRPO Data Overlap as a Post-Training Hyperparameter for Autoformalization

作者: Xiaole Su, Kasey Zhang, Andy Lyu 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13515v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	15.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究SFT-GRPO数据重叠作为后训练超参数，与’Post-training OR Supervised Fine-tuning OR SFT’高度相关（15分），因为SFT是核心方法；使用Qwen3-8B模型，与’Large Language Models OR LLMs OR Foundation Models’相关（10分）；应用于Lean 4 autoformalization，属于科学领域AI应用，与’AI for Science OR Bioinformatics OR Cheminformatics’相关（10分）；其他关键词如MoE、Scaling Laws、RLHF等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了SFT-GRPO后训练中数据重叠作为超参数的影响，发现保持SFT和GRPO数据分离（0%重叠）在零额外计算成本下始终优于完全重叠，在Gaokao-Formal上GRPO比单独SFT带来10.4个百分点的语义增益。

摘要翻译

监督微调（SFT）后接分组相对策略优化（GRPO）是一种常见的后训练方案。我们对SFT与GRPO的数据重叠度进行了受控消融实验，在六种仅训练方案不同的条件下评估了针对Lean 4自动形式化任务进行后训练的Qwen3-8B（禁用思维链）模型：包括基础模型、仅SFT、仅GRPO，以及三种SFT+GRPO配置（其中GRPO提示与SFT语料的重合比例分别为0%、30%或100%）。保持SFT与GRPO数据完全分离的策略在无需额外计算成本的情况下，始终优于完全重叠的方案。通过在Gaokao-Formal和PutnamBench数据集上采用k次编译通过率（compile pass at k）和基于LLM评判的k次语义通过率（semantic pass at k）进行评估，我们发现较低的数据重叠度与较高的编译及语义准确率呈单调正相关。在0%重叠度下，GRPO在Gaokao数据集上相比仅使用SFT带来了10.4个百分点的语义性能提升；而在100%重叠度下，两项指标均无显著变化，使得GRPO阶段实质上冗余。我们进一步指出，双指标评估揭示了编译成功率最高的模型存在超过30个百分点的编译-语义性能差距，这种差异在仅使用编译基准测试时无法显现。据我们所知，这是首次将SFT-GRPO数据重叠度作为后训练超参数进行的受控研究，揭示了模型行为如何随训练阶段间的数据共享程度而变化。

摘要 (Abstract)

Supervised fine-tuning (SFT) followed by Group Relative Policy Optimization (GRPO) is a common post-training recipe. We conduct a controlled ablation over SFT-GRPO data overlap, evaluating Qwen3-8B (thinking disabled) post-trained for Lean 4 autoformalization under six conditions that differ solely in training recipe: a base model, SFT-only, GRPO-only, and three SFT+GRPO configurations where 0 percent, 30 percent, or 100 percent of the GRPO prompts coincide with the SFT corpus. Keeping SFT and GRPO data disjoint consistently outperforms full overlap at zero additional compute cost. Evaluating on Gaokao-Formal and PutnamBench under both compile pass at k and semantic pass at k assessed by an LLM judge, we find that lower overlap is monotonically associated with higher compilation and semantic accuracy. At 0 percent overlap, GRPO yields a 10.4 percentage point semantic gain over SFT alone on Gaokao, while at 100 percent overlap both metrics remain flat, rendering the GRPO stage effectively redundant. We further show that dual-metric evaluation reveals compile semantic gaps exceeding 30 percentage points for the highest compiling models, a disparity invisible under compile-only benchmarking. To our knowledge, this is the first controlled investigation of SFT-GRPO data overlap as a post-training hyperparameter, demonstrating how model behavior varies based on the degree of data sharing between training stages.

关键词: Supervised Fine-tuning, SFT, Group Relative Policy Optimization, GRPO, data overlap, post-training, autoformalization, Lean 4

82. ❌ Chain of Uncertain Rewards with Large Language Models for Reinforcement Learning

作者: Shentong Mo 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13504v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出CoUR框架，将大语言模型（LLMs）集成到强化学习中，用于奖励函数设计和评估，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文未涉及其他关键词的具体技术或应用，如MoE、SLMs、Scaling Laws、各种训练方法、推理技术、代理系统、压缩加速、可解释性、科学AI等，因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为CoUR的新框架，利用大语言模型来优化强化学习中的奖励函数设计和评估，在多个基准测试中实现了更好的性能并显著降低了评估成本。

摘要翻译

设计有效的奖励函数是强化学习（RL）的基石，但由于传统方法固有的低效性和不一致性，这仍然是一个耗时且劳动密集型的过程。现有方法通常依赖大量手动设计和评估步骤，这些步骤容易产生冗余，并忽视中间决策点的局部不确定性。为解决这些挑战，我们提出了不确定性奖励链（Chain of Uncertain Rewards, CoUR），这是一个集成大语言模型（LLMs）的新型框架，旨在简化和优化强化学习环境中的奖励函数设计与评估。具体而言，我们的CoUR引入了代码不确定性量化机制，并结合相似性选择方法，通过文本与语义分析来识别并复用最相关的奖励函数组件。通过减少冗余评估，并对解耦的奖励项进行贝叶斯优化，CoUR能够更高效、更稳健地搜索最优奖励反馈。我们在IsaacGym的九个原始环境以及双手灵巧操作基准（Bidexterous Manipulation benchmark）的全部20项任务上对CoUR进行了全面评估。实验结果表明，CoUR不仅实现了更优的性能，还显著降低了奖励评估的成本。

摘要 (Abstract)

Designing effective reward functions is a cornerstone of reinforcement learning (RL), yet it remains a challenging and labor-intensive process due to the inefficiencies and inconsistencies inherent in traditional methods. Existing methods often rely on extensive manual design and evaluation steps, which are prone to redundancy and overlook local uncertainties at intermediate decision points. To address these challenges, we propose the Chain of Uncertain Rewards (CoUR), a novel framework that integrates large language models (LLMs) to streamline reward function design and evaluation in RL environments. Specifically, our CoUR introduces code uncertainty quantification with a similarity selection mechanism that combines textual and semantic analyses to identify and reuse the most relevant reward function components. By reducing redundant evaluations and leveraging Bayesian optimization on decoupled reward terms, CoUR enables a more efficient and robust search for optimal reward feedback. We comprehensively evaluate CoUR across nine original environments from IsaacGym and all 20 tasks from the Bidexterous Manipulation benchmark. The experimental results demonstrate that CoUR not only achieves better performance but also significantly lowers the cost of reward evaluations.

关键词: Large Language Models, Reinforcement Learning, Reward Function Design, Uncertainty Quantification, Bayesian Optimization, Code Similarity, Robotics Environments, Evaluation Efficiency

83. ❌ Towards Scalable Lightweight GUI Agents via Multi-role Orchestration

作者: Ziwei Wang, Junjie Zheng, Leyang Yang, Sheng Zhou, Xiaoxuan Tang, Zhouhua Fang, Zhiwei Liu, Dajun Chen, Yong Li, Jiajun Bu 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13488v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	10.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文聚焦于开发轻量级多模态大语言模型（MLLM）驱动的GUI代理，核心贡献是LAMO框架和LAMO-3B代理。高度相关的关键词包括：1）‘Large Language Models’（使用MLLMs作为基础），2）‘Small Language Models’（开发3B参数的轻量级模型用于端侧部署），3）‘Supervised Fine-tuning’（采用SFT进行知识蒸馏和视觉感知增强），4）‘LLM Agents’（构建自主GUI代理），5）‘Multi-agent Systems’（支持MAS风格的编排和协调）。其他关键词如MoE、Scaling Laws、RLHF、RAG等未在摘要中提及或与论文核心内容无关。

!!! tip deepseek-chat TL;DR

该论文解决了轻量级多模态大语言模型在资源受限设备上部署GUI代理时面临的成本与可扩展性困境，提出了LAMO框架，通过角色导向数据合成和两阶段训练（监督微调与强化学习），开发了支持单机执行和多代理系统编排的LAMO-3B代理，实现了高效的GUI自动化。

摘要翻译

由多模态大语言模型驱动的自主图形用户界面代理能够在终端用户设备上实现数字自动化。尽管扩大参数规模和数据量已带来显著性能提升，但先进方法在资源受限设备上仍面临高昂的部署成本。面对复杂的真实场景时，轻量级GUI代理在端到端情景学习范式下受限于模型容量不足与任务可扩展性差的问题，难以适应多智能体系统架构，而训练多个专用技能专家模型仍成本高昂。我们能否在成本与可扩展性的困境中达成有效平衡，使轻量级多模态大语言模型能够参与实际GUI工作流？为应对这些挑战，我们提出LAMO框架，该框架赋予轻量级多模态大语言模型GUI领域专用知识与任务可扩展性，通过多角色编排机制拓展其在GUI自动化中的能力边界。LAMO结合面向角色的数据合成与两阶段训练方案：（一）采用基于困惑度加权的交叉熵优化进行监督微调，实现知识蒸馏与视觉感知增强；（二）通过强化学习开展面向角色的协同探索。基于LAMO框架，我们开发了具备任务可扩展性的原生GUI代理LAMO-3B，支持单体执行与多智能体系统式编排。当与先进规划器作为即插即用的策略执行器配合使用时，LAMO-3B能够持续受益于规划器的进步，实现更高的性能上限。大量静态与在线评估验证了我们设计的有效性。

摘要 (Abstract)

Autonomous Graphical User Interface (GUI) agents powered by Multimodal Large Language Models (MLLMs) enable digital automation on end-user devices. While scaling both parameters and data has yielded substantial gains, advanced methods still suffer from prohibitive deployment costs on resource-constrained devices. When facing complex in-the-wild scenarios, lightweight GUI agents are bottlenecked by limited capacity and poor task scalability under end-to-end episodic learning, impeding adaptation to multi-agent systems (MAS), while training multiple skill-specific experts remains costly. Can we strike an effective trade-off in this cost-scalability dilemma, enabling lightweight MLLMs to participate in realistic GUI workflows? To address these challenges, we propose the LAMO framework, which endows a lightweight MLLM with GUI-specific knowledge and task scalability, allowing multi-role orchestration to expand its capability boundary for GUI automation. LAMO combines role-oriented data synthesis with a two-stage training recipe: (i) supervised fine-tuning with Perplexity-Weighted Cross-Entropy optimization for knowledge distillation and visual perception enhancement, and (ii) reinforcement learning for role-oriented cooperative exploration. With LAMO, we develop a task-scalable native GUI agent, LAMO-3B, supporting monolithic execution and MAS-style orchestration. When paired with advanced planners as a plug-and-play policy executor, LAMO-3B can continuously benefit from planner advances, enabling a higher performance ceiling. Extensive static and online evaluations validate the effectiveness of our design.

关键词: GUI agents, Multimodal Large Language Models, lightweight models, multi-agent systems, supervised fine-tuning, reinforcement learning, task scalability, end-user devices

84. ❌ Monthly Diffusion v0.9: A Latent Diffusion Model for the First AI-MIP

作者: Kyle J. C. Hall, Maria J. Molina 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13481v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究的是气候模拟器MDv0.9，使用基于球形傅里叶神经算子（SFNO）的条件变分自编码器（CVAE）架构和潜在扩散模型来模拟低频内部大气变率。所有关键词都与大语言模型（LLM）或深度学习技术原理相关，但该论文专注于气候科学领域的特定AI应用，未涉及LLM、MoE、SLMs、缩放定律、预训练、后训练、对齐、RLHF、PEFT、RAG、上下文窗口、注意力优化、推理方法、代理系统、工具使用、多代理、量化、推理加速、幻觉缓解、可解释性、世界模型、模型合并、上下文学习等LLM相关技术。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在科学（气候科学）领域的应用，但并非生物信息学或化学信息学，因此给5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文开发了Monthly Diffusion v0.9（MDv0.9），一个基于球形傅里叶神经算子条件变分自编码器架构的潜在扩散模型，用于在数据稀疏和计算资源有限的情况下模拟月时间尺度的低频内部大气变率，并描述了其架构设计、训练过程和初步结果。

摘要翻译

本文介绍了一种基于球面傅里叶神经算子（SFNO）启发的条件变分自编码器（CVAE）架构的气候模拟器——月尺度1.5度网格扩散模型（MD-1.5 version 0.9），该模型利用潜在扩散方法模拟低频内部大气变率的演变过程。MDv0.9的设计目标是在数据稀疏条件下，以月平均时间步长进行前向模拟，同时保持适中的计算资源需求。本文阐述了该架构的设计动机、MDv0.9的训练流程以及初步实验结果。

摘要 (Abstract)

Here, we describe Monthly Diffusion at 1.5-degree grid spacing (MD-1.5 version 0.9), a climate emulator that leverages a spherical Fourier neural operator (SFNO)-inspired Conditional Variational Auto-Encoder (CVAE) architecture to model the evolution of low-frequency internal atmospheric variability using latent diffusion. MDv0.9 was designed to forward-step at monthly mean timesteps in a data-sparse regime, using modest computational requirements. This work describes the motivation behind the architecture design, the MDv0.9 training procedure, and initial results.

关键词: climate emulator, latent diffusion model, spherical Fourier neural operator, conditional variational auto-encoder, atmospheric variability, monthly timesteps, data-sparse regime, AI-MIP

85. ❌ Secure and Privacy-Preserving Vertical Federated Learning

作者: Shan Jin, Sai Rahul Rachuri, Yizhen Wang, Anderson C. A. Nascimento, Yiwei Cai 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13474v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于联邦学习（FL）中的隐私保护技术，特别是垂直联邦学习场景，提出了基于安全多方计算（MPC）和差分隐私（DP）的框架和协议。所有评分关键词均与大模型、深度学习技术原理或科学AI应用直接相关，而本文研究的是联邦学习的隐私安全机制，属于分布式机器学习的安全领域，与所有关键词均无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种用于垂直联邦学习的安全隐私保护框架，通过分布式聚合服务器和优化协议，在保护输入输出隐私的同时显著减少了计算和通信开销。

摘要翻译

我们提出了一种新颖的端到端隐私保护框架，通过三种适用于不同部署场景的高效协议实例化，涵盖联邦学习（FL）中垂直数据划分场景下的输入与输出隐私保护。在该场景中，特征数据分散于各客户端且标签并非所有参与方共享。为实现这一目标，我们将联邦学习中聚合器的角色分配给多个服务器，使其运行安全多方计算（MPC）协议以执行模型与特征聚合，并对最终发布的模型施加差分隐私（DP）保护。若采用简单方案，客户端需将全部训练过程委托给服务器间通过MPC执行；而我们的优化方案在支持纯全局更新及兼顾隐私保护的全局-局部模型更新的同时，大幅降低了多方计算所需的计算与通信开销。实验结果也验证了我们协议的有效性。

摘要 (Abstract)

We propose a novel end-to-end privacy-preserving framework, instantiated by three efficient protocols for different deployment scenarios, covering both input and output privacy, for the vertically split scenario in federated learning (FL), where features are split across clients and labels are not shared by all parties. We do so by distributing the role of the aggregator in FL into multiple servers and having them run secure multiparty computation (MPC) protocols to perform model and feature aggregation and apply differential privacy (DP) to the final released model. While a naive solution would have the clients delegating the entirety of training to run in MPC between the servers, our optimized solution, which supports purely global and also global-local models updates with privacy-preserving, drastically reduces the amount of computation and communication performed using multiparty computation. The experimental results also show the effectiveness of our protocols.

关键词: Vertical Federated Learning, Privacy-Preserving, Secure Multiparty Computation, Differential Privacy, Model Aggregation, Feature Aggregation, End-to-End Framework, Optimized Protocols

86. ❌ Learning from Change: Predictive Models for Incident Prevention in a Regulated IT Environment

作者: Eileen Kapel, Jan Lennartz, Luis Cruz, Diomidis Spinellis, Arie van Deursen 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13462v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究IT变更管理中的事件风险预测，使用传统机器学习模型（HGBC、LightGBM、XGBoost）和SHAP进行可解释性分析，未涉及大模型、深度学习或AI for Science等关键词的核心技术，仅与’Mechanistic Interpretability OR Explainable AI’有一定关联（使用SHAP提供特征级解释），其他关键词均无关。

!!! tip deepseek-chat TL;DR

该研究提出了一种在受监管IT环境中预测变更引发事件风险的方法，通过比较机器学习模型与基于规则的方法，发现LightGBM在加入团队聚合指标后表现最佳，实现了数据驱动、可解释的风险评分，满足合规需求并提升IT运营可靠性。

摘要翻译

有效的IT变更管理对于依赖软件与服务的企业至关重要，尤其在金融等受严格监管的行业，其运营可靠性、可审计性与可解释性不可或缺。相当比例的IT事件由变更引发，因此在部署前识别高风险变更尤为重要。本研究提出了一种应用于一家大型国际银行的事件风险预测性评分方法。该方法通过在变更部署的评估与规划阶段预测可能引发事件的潜在风险，为工程师提供支持。为满足监管要求，我们在构建模型时充分考虑了可审计性与可解释性，应用SHAP值提供特征层面的洞察，确保决策可追溯且透明。基于一年的真实数据集，我们将现有的基于规则的流程与三种机器学习模型（HGBC、LightGBM和XGBoost）进行了比较。LightGBM表现出最佳性能，尤其是在加入能够捕捉组织背景的聚合团队指标后。我们的结果表明，数据驱动且可解释的模型在满足合规需求的同时，能够超越基于规则的方法，从而实现主动的风险缓解与更可靠的IT运营。

摘要 (Abstract)

Effective IT change management is important for businesses that depend on software and services, particularly in highly regulated sectors such as finance, where operational reliability, auditability, and explainability are essential. A significant portion of IT incidents are caused by changes, making it important to identify high-risk changes before deployment. This study presents a predictive incident risk scoring approach at a large international bank. The approach supports engineers during the assessment and planning phases of change deployments by predicting the potential of inducing incidents. To satisfy regulatory constraints, we built the model with auditability and explainability in mind, applying SHAP values to provide feature-level insights and ensure decisions are traceable and transparent. Using a one-year real-world dataset, we compare the existing rule-based process with three machine learning models: HGBC, LightGBM, and XGBoost. LightGBM achieved the best performance, particularly when enriched with aggregated team metrics that capture organisational context. Our results show that data-driven, interpretable models can outperform rule-based approaches while meeting compliance needs, enabling proactive risk mitigation and more reliable IT operations.

关键词: IT change management, incident risk prediction, machine learning, interpretability, SHAP, LightGBM, regulatory compliance, risk mitigation

87. ❌ Bridging MARL to SARL: An Order-Independent Multi-Agent Transformer via Latent Consensus

作者: Zijian Zhao, Jing Gao, Sen Li 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13472v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于多智能体强化学习（MARL）领域，提出了一种名为CMAT的集中式Transformer框架，用于解决多智能体协作中的协调问题。论文的核心是MARL算法设计，使用了Transformer架构进行联合观测处理，并引入了潜在共识机制实现顺序无关的联合决策。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐技术等）完全无关，仅与"Multi-agent Systems OR Agent Coordination"高度相关（10分），因为这是论文的核心研究主题。论文不涉及任何大语言模型或深度学习技术原理的创新，也不属于生物医药AI应用领域，因此其他所有关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于Transformer的共识多智能体强化学习框架（CMAT），通过潜在共识机制解决了多智能体协作中的顺序依赖问题，在多个基准任务上实现了优于现有方法的性能。

摘要翻译

协作多智能体强化学习（MARL）通过将集中式控制问题分解为多个交互智能体，被广泛用于处理庞大的联合观测与动作空间。然而，这种分解常引入额外挑战，包括非平稳性、训练不稳定、协调性弱以及理论保证有限。本文提出共识多智能体Transformer（CMAT），这是一个将协作MARL与分层单智能体强化学习（SARL）框架相衔接的集中式架构。CMAT将所有智能体视为统一整体，并采用Transformer编码器处理庞大的联合观测空间。为应对广阔的联合动作空间，我们引入了一种分层决策机制：其中Transformer解码器以自回归方式生成高层共识向量，模拟智能体在潜在空间中对策略达成一致的过程。基于此共识，所有智能体同步生成动作，实现了与顺序无关的联合决策，避免了传统多智能体Transformer（MAT）中对动作生成顺序的敏感性。这种分解方式使得联合策略能够使用单智能体PPO进行优化，同时通过潜在共识保持丰富的协调能力。为评估所提方法，我们在星际争霸II、多智能体MuJoCo以及谷歌研究足球等基准任务上进行了实验。结果表明，CMAT在性能上优于近期的集中式解决方案、顺序MARL方法以及传统MARL基线。本文代码发布于：https://github.com/RS2002/CMAT。

摘要 (Abstract)

Cooperative multi-agent reinforcement learning (MARL) is widely used to address large joint observation and action spaces by decomposing a centralized control problem into multiple interacting agents. However, such decomposition often introduces additional challenges, including non-stationarity, unstable training, weak coordination, and limited theoretical guarantees. In this paper, we propose the Consensus Multi-Agent Transformer (CMAT), a centralized framework that bridges cooperative MARL to a hierarchical single-agent reinforcement learning (SARL) formulation. CMAT treats all agents as a unified entity and employs a Transformer encoder to process the large joint observation space. To handle the extensive joint action space, we introduce a hierarchical decision-making mechanism in which a Transformer decoder autoregressively generates a high-level consensus vector, simulating the process by which agents reach agreement on their strategies in latent space. Conditioned on this consensus, all agents generate their actions simultaneously, enabling order-independent joint decision making and avoiding the sensitivity to action-generation order in conventional Multi-Agent Transformers (MAT). This factorization allows the joint policy to be optimized using single-agent PPO while preserving expressive coordination through the latent consensus. To evaluate the proposed method, we conduct experiments on benchmark tasks from StarCraft II, Multi-Agent MuJoCo, and Google Research Football. The results show that CMAT achieves superior performance over recent centralized solutions, sequential MARL methods, and conventional MARL baselines. The code for this paper is available at:https://github.com/RS2002/CMAT .

关键词: Multi-Agent Reinforcement Learning, Transformer, Consensus Mechanism, Centralized Training, Order-Independent Decision Making, Latent Space Coordination, Hierarchical Decision Making, Cooperative Agents

88. ❌ From Order to Distribution: A Spectral Characterization of Forgetting in Continual Learning

作者: Zonghuan Xu, Xingjun Ma 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13460v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究持续学习中的遗忘问题，采用理论分析方法，聚焦于线性回归模型中的任务分布特性。所有评分关键词均与大模型、深度学习技术原理或具体应用相关，而本文属于机器学习理论分析领域，未涉及大模型技术、深度学习创新或科学应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文从任务分布而非任务顺序的视角，理论分析了持续学习中遗忘问题的谱结构，建立了遗忘量的精确算子恒等式，并揭示了任务分布的几何特性如何决定遗忘速率。

摘要翻译

持续学习中的一个核心挑战是遗忘，即因顺序适应新任务而导致先前已习得任务性能下降的现象。尽管遗忘已在实证研究中得到广泛探讨，但严格的理论刻画仍较为有限。该领域的一个重要进展是\citet{evron2022catastrophic}的工作，其分析了在过参数化线性回归中，固定任务集合在随机排序下的遗忘行为。我们将视角从顺序转向分布：不同于探究固定任务集合在随机排序下的表现，我们研究一个精确拟合的线性机制，其中任务从任务分布~$Π$中独立同分布采样，并探究生成分布本身如何支配遗忘过程。在此设定下，我们推导出遗忘量的精确算子恒等式，揭示了一种递归的谱结构。基于此恒等式，我们建立了一个无条件上界，确定了主导渐近项，并在一般非退化情形下，以常数精度刻画了收敛速率。我们进一步将此速率与任务分布的几何特性相关联，阐明了在该模型中驱动遗忘快慢的关键因素。

摘要 (Abstract)

A central challenge in continual learning is forgetting, the loss of performance on previously learned tasks induced by sequential adaptation to new ones. While forgetting has been extensively studied empirically, rigorous theoretical characterizations remain limited. A notable step in this direction is \citet{evron2022catastrophic}, which analyzes forgetting under random orderings of a fixed task collection in overparameterized linear regression. We shift the perspective from order to distribution. Rather than asking how a fixed task collection behaves under random orderings, we study an exact-fit linear regime in which tasks are sampled i.i.d.\ from a task distribution~$Π$, and ask how the generating distribution itself governs forgetting. In this setting, we derive an exact operator identity for the forgetting quantity, revealing a recursive spectral structure. Building on this identity, we establish an unconditional upper bound, identify the leading asymptotic term, and, in generic nondegenerate cases, characterize the convergence rate up to constants. We further relate this rate to geometric properties of the task distribution, clarifying what drives slow or fast forgetting in this model.

关键词: continual learning, forgetting, task distribution, spectral characterization, linear regression, theoretical analysis, operator identity, convergence rate

89. ❌ Asymmetric-Loss-Guided Hybrid CNN-BiLSTM-Attention Model for Industrial RUL Prediction with Interpretable Failure Heatmaps

作者: Mohammed Ezzaldin Babiker Abdullah 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13459v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于工业预测性维护中的剩余使用寿命（RUL）预测，使用混合CNN-BiLSTM-Attention架构，与大多数大模型技术关键词（如LLMs、MoE、RLHF等）完全无关。仅与’Mechanistic Interpretability OR Explainable AI’有一定关联（5分），因为论文提到使用注意力权重热图提供可解释性；与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），因为论文属于AI在工业科学（航空发动机）中的应用，但非生物信息学或化学信息学核心领域。其他关键词均不相关（0分）。

!!! tip deepseek-chat TL;DR

该研究提出了一种混合CNN-BiLSTM-Attention模型，用于航空发动机剩余使用寿命预测，通过不对称损失函数和注意力热图实现了安全、可解释的工业预测，在NASA数据集上取得了竞争性性能。

摘要翻译

在持续运行应力作用下，涡轮风扇发动机的性能退化需要能够准确估计关键部件剩余使用寿命（Remaining Useful Life, RUL）的鲁棒性预测系统。现有的深度学习方法往往难以同时捕捉多传感器空间相关性与长程时间依赖性，而标准的对称损失函数未能充分惩罚高估剩余寿命这一对安全至关重要的误差。本研究提出了一种混合架构，该架构集成了双阶段一维卷积神经网络（1D-CNN）、双向长短期记忆（BiLSTM）网络以及定制的Bahdanau加性注意力机制。模型在NASA商用模块化航空推进系统仿真（C-MAPSS）FD001子数据集上进行了训练与评估，采用了零泄漏预处理流程、上限为130个循环的分段线性RUL标签，以及NASA指定的非对称指数损失函数——该函数对高估误差施加不成比例的惩罚，以符合工业安全约束。在100台测试发动机上的实验实现了17.52个循环的均方根误差（RMSE）和922.06的NASA S分数。此外，提取的注意力权重热图为每台发动机的退化时间进程提供了可解释的洞察，支持基于信息的维护决策制定。所提出的框架相较于既有基线模型展现出具有竞争力的性能，并为工业环境中的安全、可解释预测提供了一种基于原理的方法。

摘要 (Abstract)

Turbofan engine degradation under sustained operational stress necessitates robust prognostic systems capable of accurately estimating the Remaining Useful Life (RUL) of critical components. Existing deep learning approaches frequently fail to simultaneously capture multi-sensor spatial correlations and long-range temporal dependencies, while standard symmetric loss functions inadequately penalize the safety-critical error of over-estimating residual life. This study proposes a hybrid architecture integrating Twin-Stage One-Dimensional Convolutional Neural Networks (1D-CNN), a Bidirectional Long Short-Term Memory (BiLSTM) network, and a custom Bahdanau Additive Attention mechanism. The model was trained and evaluated on the NASA Commercial Modular Aero-Propulsion System Simulation (C-MAPSS) FD001 sub-dataset employing a zero-leakage preprocessing pipeline, piecewise-linear RUL labeling capped at 130 cycles, and the NASA-specified asymmetric exponential loss function that disproportionately penalizes over-estimation to enforce industrial safety constraints. Experiments on 100 test engines achieved a Root Mean Squared Error (RMSE) of 17.52 cycles and a NASA S-Score of 922.06. Furthermore, extracted attention weight heatmaps provide interpretable, per-engine insights into the temporal progression of degradation, supporting informed maintenance decision-making. The proposed framework demonstrates competitive performance against established baselines and offers a principled approach to safe, interpretable prognostics in industrial settings.

关键词: Remaining Useful Life (RUL) prediction, CNN-BiLSTM-Attention hybrid model, asymmetric loss function, industrial prognostics, interpretable attention heatmaps, NASA C-MAPSS dataset, turbofan engine degradation, predictive maintenance

90. ❌ Outperforming Self-Attention Mechanisms in Solar Irradiance Forecasting via Physics-Guided Neural Networks

作者: Mohammed Ezzaldin Babiker Abdullah, Rufaidah Abdallah Ibrahim Mohammed 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13455v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于太阳能辐照度预测，提出了一种物理信息混合CNN-BiLSTM框架，挑战了基于Transformer的复杂模型范式。论文的核心是深度学习在可再生能源领域的应用，但并未涉及大语言模型（LLM）或任何评分关键词中的具体技术（如MoE、RLHF、RAG等）。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在科学（气象学、可再生能源）中的应用，但并非核心生物信息学或化学信息学，因此给予5分（有一定关联）。其他关键词均与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种物理信息混合CNN-BiLSTM框架用于全球水平辐照度预测，在苏丹的NASA数据上验证显示其RMSE为19.53 W/m²，显著优于复杂的基于注意力的基线模型（RMSE 30.64 W/m²），揭示了在高噪声气象任务中物理约束比自注意力机制更高效准确的'复杂性悖论'。

摘要翻译

精确的全球水平辐照度（GHI）预测对电网稳定性至关重要，在气溶胶快速波动的干旱地区尤为如此。尽管当前趋势倾向于计算成本高昂的基于Transformer的架构，本文对主流的“复杂度优先”范式提出了挑战。我们提出了一种轻量级的物理信息混合CNN-BiLSTM框架，该框架将领域知识置于架构深度之上。该模型结合了用于空间特征提取的卷积神经网络（CNN）与用于捕捉时间依赖性的双向长短期记忆网络（Bi-Directional LSTM）。与标准的数据驱动方法不同，我们的模型明确受到一个包含15个工程特征（如晴空指数和太阳天顶角）的特征向量的引导，而非仅仅依赖原始历史数据。通过贝叶斯优化对超参数进行严格调优，以确保全局最优性。利用苏丹地区的NASA POWER数据进行实验验证表明，我们的物理引导方法实现了19.53 W/m²的均方根误差（RMSE），显著优于复杂的基于注意力机制的基线模型（RMSE 30.64 W/m²）。这些结果证实了“复杂性悖论”：在高噪声气象任务中，显式的物理约束为自注意力机制提供了一种更高效、更准确的替代方案。本研究主张在实时可再生能源管理领域，向融合物理知识的混合人工智能范式转变。

摘要 (Abstract)

Accurate Global Horizontal Irradiance (GHI) forecasting is critical for grid stability, particularly in arid regions characterized by rapid aerosol fluctuations. While recent trends favor computationally expensive Transformer-based architectures, this paper challenges the prevailing “complexity-first” paradigm. We propose a lightweight, Physics-Informed Hybrid CNN-BiLSTM framework that prioritizes domain knowledge over architectural depth. The model integrates a Convolutional Neural Network (CNN) for spatial feature extraction with a Bi-Directional LSTM for capturing temporal dependencies. Unlike standard data-driven approaches, our model is explicitly guided by a vector of 15 engineered features including Clear-Sky indices and Solar Zenith Angle - rather than relying solely on raw historical data. Hyperparameters are rigorously tuned using Bayesian Optimization to ensure global optimality. Experimental validation using NASA POWER data in Sudan demonstrates that our physics-guided approach achieves a Root Mean Square Error (RMSE) of 19.53 W/m^2, significantly outperforming complex attention-based baselines (RMSE 30.64 W/m^2). These results confirm a “Complexity Paradox”: in high-noise meteorological tasks, explicit physical constraints offer a more efficient and accurate alternative to self-attention mechanisms. The findings advocate for a shift towards hybrid, physics-aware AI for real-time renewable energy management.

关键词: Solar Irradiance Forecasting, Physics-Guided Neural Networks, CNN-BiLSTM, Global Horizontal Irradiance (GHI), Complexity Paradox, Renewable Energy Management, NASA POWER data, Bayesian Optimization

91. ❌ A Study of Failure Modes in Two-Stage Human-Object Interaction Detection

作者: Lemeng Wang, Qinqian Lei, Vidhi Bakshi, Daniel Yi, Yifan Liu, Jiacheng Hou, Asher Seng Hao, Zheda Mai, Wei-Lun Chao, Robby T. Tan, Bo Wang 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13448v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究计算机视觉中的人-物交互检测模型的失败模式分析，属于传统的计算机视觉任务，未涉及大语言模型、深度学习技术原理创新或大模型在不同领域的应用。所有关键词均与大模型、深度学习技术原理或AI for Science相关，与该论文的视觉检测研究内容完全无关。

!!! tip deepseek-chat TL;DR

该论文研究了二阶段人-物交互检测模型的失败模式，通过分解多个可解释视角分析模型在不同场景配置下的行为，发现高基准性能并不代表对人物关系的稳健视觉推理。

摘要翻译

人-物交互检测旨在识别图像中人与物体之间的互动关系。尽管近期研究在现有基准测试中提升了性能表现，但其评估主要集中于整体预测精度，对模型失效的根本原因揭示有限。现有模型尤其在涉及多人物及罕见交互组合的复杂场景中表现欠佳。本研究通过系统分析来深入理解两阶段人-物交互检测模型的失效模式——该架构构成当前多数检测方法的基础。区别于构建大规模基准测试，我们将人-物交互检测任务分解为多个可解释的维度，通过跨维度分析模型行为来研究不同类型的失效规律。我们基于现有人-物交互数据集，按人-物-交互配置（如多人交互、物体共享等场景）筛选并整理图像子集，通过分析模型在这些配置下的行为来考察各类失效模式。该设计使我们能够系统分析人-物交互模型在不同场景构成下的行为规律及其预测失效的深层原因。需要强调的是，较高的整体基准性能并不必然意味着模型具有稳健的人-物关系视觉推理能力。本研究期望能为人-物交互模型的局限性提供有价值的见解，并为该领域的未来研究方向提供实证观察。

摘要 (Abstract)

Human-object interaction (HOI) detection aims to detect interactions between humans and objects in images. While recent advances have improved performance on existing benchmarks, their evaluations mainly focus on overall prediction accuracy and provide limited insight into the underlying causes of model failures. In particular, modern models often struggle in complex scenes involving multiple people and rare interaction combinations. In this work, we present a study to better understand the failure modes of two-stage HOI models, which form the basis of many current HOI detection approaches. Rather than constructing a large-scale benchmark, we instead decompose HOI detection into multiple interpretable perspectives and analyze model behavior across these dimensions to study different types of failure patterns. We curate a subset of images from an existing HOI dataset organized by human-object-interaction configurations (e.g., multi-person interactions and object sharing), and analyze model behavior under these configurations to examine different failure modes. This design allows us to analyze how these HOI models behave under different scene compositions and why their predictions fail. Importantly, high overall benchmark performance does not necessarily reflect robust visual reasoning about human-object relationships. We hope that this study can provide useful insights into the limitations of HOI models and offer observations for future research in this area.

关键词: Human-object interaction detection, Failure modes, Two-stage models, Model behavior analysis, Visual reasoning, Multi-person interactions, Object sharing, Benchmark evaluation

92. ❌ MaMe & MaRe: Matrix-Based Token Merging and Restoration for Efficient Visual Perception and Synthesis

作者: Simin Huo, Ning Li 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13432v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于视觉Transformer（ViT）的token压缩和恢复方法（MaMe和MaRe），旨在加速视觉感知和合成任务。所有给定的关键词均与大语言模型（LLM）或大模型在科学领域的应用相关，而本文研究的是视觉模型（ViT）的效率优化，属于计算机视觉领域，与LLM、MoE、对齐、推理、代理等关键词无直接关联。因此，所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于矩阵操作的训练免费token合并方法MaMe及其逆操作MaRe，用于加速视觉Transformer模型，在图像分类、视频理解和图像合成任务中实现了显著的加速效果和性能提升。

摘要翻译

令牌压缩对于缓解视觉Transformer（ViT）中自注意力机制的二次复杂度至关重要，因为ViT通常涉及大量输入令牌。现有方法（如ToMe）依赖于GPU效率较低的操作（例如排序、分散写入），引入了限制其有效性的开销。我们提出MaMe，一种完全基于矩阵运算的免训练、可微分令牌合并方法，其GPU友好特性可加速ViT。此外，我们提出其逆操作MaRe用于令牌恢复，形成用于图像合成的MaMe+MaRe流程。当应用于预训练模型时，MaMe使ViT-B的吞吐量翻倍，同时仅导致2%的准确率下降。值得注意的是，使用MaMe对最后一层进行微调可使ViT-B在1.1倍速度下准确率提升1.0%。在SigLIP2-B@512的零样本分类任务中，MaMe实现了1.3倍加速且性能下降可忽略不计。在视频任务中，MaMe在Kinetics-400数据集上将VideoMAE-L加速48.5%，仅损失0.84%的准确率。此外，MaMe在某些任务上实现了性能与速度的同时提升。在图像合成中，MaMe+MaRe流程在提升生成质量的同时，将Stable Diffusion v2.1的生成延迟降低31%。总体而言，这些结果证明了MaMe和MaRe在加速视觉模型方面的有效性。代码发布于https://github.com/cominder/mame。

摘要 (Abstract)

Token compression is crucial for mitigating the quadratic complexity of self-attention mechanisms in Vision Transformers (ViTs), which often involve numerous input tokens. Existing methods, such as ToMe, rely on GPU-inefficient operations (e.g., sorting, scattered writes), introducing overheads that limit their effectiveness. We introduce MaMe, a training-free, differentiable token merging method based entirely on matrix operations, which is GPU-friendly to accelerate ViTs. Additionally, we present MaRe, its inverse operation, for token restoration, forming a MaMe+MaRe pipeline for image synthesis. When applied to pre-trained models, MaMe doubles ViT-B throughput with a 2% accuracy drop. Notably, fine-tuning the last layer with MaMe boosts ViT-B accuracy by 1.0% at 1.1x speed. In SigLIP2-B@512 zero-shot classification, MaMe provides 1.3x acceleration with negligible performance degradation. In video tasks, MaMe accelerates VideoMAE-L by 48.5% on Kinetics-400 with only a 0.84% accuracy loss. Furthermore, MaMe achieves simultaneous improvements in both performance and speed on some tasks. In image synthesis, the MaMe+MaRe pipeline enhances quality while reducing Stable Diffusion v2.1 generation latency by 31%. Collectively, these results demonstrate MaMe’s and MaRe’s effectiveness in accelerating vision models. The code is available at https://github.com/cominder/mame}{https://github.com/cominder/mame.

关键词: token compression, Vision Transformers, self-attention, matrix operations, GPU acceleration, token restoration, image synthesis, efficient inference

93. ❌ A KL Lens on Quantization: Fast, Forward-Only Sensitivity for Mixed-Precision SSM-Transformer Models

作者: Jason Kong, Nilesh Prasad Pandey, Flavio Ponzina, Tajana Rosing 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13440v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	10.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在边缘设备上的部署优化，直接涉及’Large Language Models’（10分）、‘Small Language Models/On-device AI’（10分）、‘Quantization/Model Compression’（10分）和’Speculative Decoding/Inference Acceleration’（10分）。论文提出了一种轻量级量化敏感度分析框架，旨在减少模型大小、加速推理，适用于资源受限的边缘设备，与这些关键词高度相关。其他关键词如MoE、Scaling Laws、Alignment、RAG等未在论文中涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于KL散度的轻量级量化敏感度分析框架，用于识别混合SSM-Transformer模型中易受量化影响的组件，实现在资源受限的边缘设备上部署大型语言模型时最小化精度损失并提升推理速度。

摘要翻译

在边缘设备上部署大型语言模型（LLM）面临严峻的计算和内存限制，制约了实时处理与设备端智能的实现。结合结构化状态空间模型（Structured State Space Models, SSMs）与基于Transformer的LLM的混合架构，在效率与性能之间提供了平衡。激进量化可大幅压缩模型尺寸并加速推理，但其对不同组件的不均匀影响需要谨慎管理。本研究提出一种轻量级、无需反向传播、基于代理的敏感性分析框架，用于识别混合SSM-Transformer架构中对量化引起的性能退化最敏感的组件。该方法仅依赖前向传播度量，避免了昂贵的梯度计算与重新训练，适用于因专有限制或隐私约束导致领域内数据访问受限的场景。我们还通过形式化分析表明，对于语言建模任务，库尔巴克-莱布勒散度（Kullback-Leibler divergence, KL散度）度量比广泛采用的均方误差（mean squared error, MSE）和信号量化噪声比（signal-to-quantization-noise ratio, SQNR）等指标更能有效捕捉量化敏感性。通过对SSM及混合架构的大量实验，消融研究证实基于KL散度的排序与观测到的性能下降一致，且优于其他度量指标。该框架使得在资源受限的边缘设备上以最小精度损失实际部署先进混合模型成为可能。我们进一步在英特尔Lunar Lake硬件上通过实际设备端性能分析验证了本方法，结果表明：在CPU和GPU两种执行模式下，基于KL散度指导的混合精度量化在模型尺寸和吞吐量上均与均匀INT4量化相当，同时实现了接近FP16的困惑度。代码发布于https://github.com/jasonkongie/kl-ssm-quant。

摘要 (Abstract)

Deploying Large Language Models (LLMs) on edge devices faces severe computational and memory constraints, limiting real-time processing and on-device intelligence. Hybrid architectures combining Structured State Space Models (SSMs) with transformer-based LLMs offer a balance of efficiency and performance. Aggressive quantization can drastically cut model size and speed up inference, but its uneven effects on different components require careful management. In this work, we propose a lightweight, backpropagation-free, surrogate-based sensitivity analysis framework to identify hybrid SSM-Transformer components most susceptible to quantization-induced degradation. Relying solely on forward-pass metrics, our method avoids expensive gradient computations and retraining, making it suitable for situations where access to in-domain data is limited due to proprietary restrictions or privacy constraints. We also provide a formal analysis showing that the Kullback-Leibler (KL) divergence metric better captures quantization sensitivity for Language modeling tasks than widely adopted alternatives such as mean squared error (MSE) and signal-to-quantization-noise ratio (SQNR). Through extensive experiments on SSM and hybrid architectures, our ablation studies confirm that KL-based rankings align with observed performance drops and outperform alternative metrics. This framework enables the practical deployment of advanced hybrid models on resource-constrained edge devices with minimal accuracy loss. We further validate our approach with real-world on-device profiling on Intel Lunar Lake hardware, demonstrating that KL-guided mixed-precision achieves near-FP16 perplexity with model sizes and throughput competitive with Uniform INT4 on both CPU and GPU execution modes. Code is available at https://github.com/jasonkongie/kl-ssm-quant.

关键词: Quantization, Large Language Models, On-device AI, Inference Acceleration, Model Compression, Edge Deployment, SSM-Transformer, KL Divergence

94. ❌ A Unified Conditional Flow for Motion Generation, Editing, and Intra-Structural Retargeting

作者: Junlin Li, Xinhao Song, Siqi Wang, Haibin Huang, Yili Zhao 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13427v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是基于流匹配（flow matching）和DiT-style transformer的运动生成、编辑和重定向任务，属于计算机视觉和图形学领域，专注于运动生成模型的技术创新。虽然论文涉及生成模型和条件生成，但所有关键词均与大语言模型（LLM）及其相关技术（如微调、对齐、推理、代理等）或特定科学领域AI应用（如生物信息学）相关，而本文完全不涉及语言模型或这些特定技术，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于条件流匹配的统一生成框架，将文本驱动的运动编辑和骨架内重定向任务统一为同一生成任务，通过一个联合训练模型实现了文本到运动生成、零样本编辑和零样本重定向。

摘要翻译

文本驱动的动作编辑与同构重定向（源与目标共享拓扑结构但骨骼长度可能不同）传统上由输入与表征互不兼容的碎片化流程处理：编辑依赖于专门的生成式导向，而重定向则被推迟至几何后处理阶段。我们提出一种统一视角，将这两类任务均视作单一生成框架内的条件传输实例。通过利用流匹配技术的最新进展，我们证明编辑与重定向本质上是相同的生成任务，其区别仅在于推理过程中被调节的条件信号是语义信号还是结构信号。我们通过一个基于整流流的动作模型实现这一构想，该模型同时以文本提示词和目标骨骼结构为条件。我们的架构采用DiT风格Transformer，通过逐关节标记化与显式关节自注意力机制严格保障运动学依赖性，同时采用多条件无分类器引导策略以平衡文本遵循度与骨骼结构契合度。在SnapMoGen数据集及多角色Mixamo子集上的实验表明，单一训练模型即可支持文本到动作生成、零样本编辑与零样本同构重定向。相较于任务专用基线方法，这种统一方案简化了部署流程并提升了结构一致性。

摘要 (Abstract)

Text-driven motion editing and intra-structural retargeting, where source and target share topology but may differ in bone lengths, are traditionally handled by fragmented pipelines with incompatible inputs and representations: editing relies on specialized generative steering, while retargeting is deferred to geometric post-processing. We present a unifying perspective where both tasks are cast as instances of conditional transport within a single generative framework. By leveraging recent advances in flow matching, we demonstrate that editing and retargeting are fundamentally the same generative task, distinguished only by which conditioning signal, semantic or structural, is modulated during inference. We implement this vision via a rectified-flow motion model jointly conditioned on text prompts and target skeletal structures. Our architecture extends a DiT-style transformer with per-joint tokenization and explicit joint self-attention to strictly enforce kinematic dependencies, while a multi-condition classifier-free guidance strategy balances text adherence with skeletal conformity. Experiments on SnapMoGen and a multi-character Mixamo subset show that a single trained model supports text-to-motion generation, zero-shot editing, and zero-shot intra-structural retargeting. This unified approach simplifies deployment and improves structural consistency compared to task-specific baselines.

关键词: motion generation, flow matching, text-driven editing, intra-structural retargeting, conditional transport, DiT-style transformer, classifier-free guidance, zero-shot

95. ❌ Event-Adaptive State Transition and Gated Fusion for RGB-Event Object Tracking

作者: Jinlin You, Muyu Li, Xudong Zhao 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13426v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是基于Vision Mamba的RGB-Event目标跟踪方法，属于计算机视觉领域，具体涉及动态状态空间模型、事件相机数据处理、多模态融合等技术。所有评分关键词均与大语言模型、深度学习技术原理、AI科学应用等大模型相关领域相关，而本文专注于视觉跟踪的特定应用，未涉及任何大模型技术、训练方法、推理优化或科学AI应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对现有RGB-Event跟踪方法中静态状态转移矩阵无法适应事件稀疏性变化的问题，提出了MambaTrack框架，通过事件自适应状态转移机制和门控投影融合模块实现了更鲁棒的跨模态跟踪，在FE108和FELT数据集上达到了最先进的性能。

摘要翻译

现有基于视觉Mamba的RGB-事件（RGBE）跟踪方法因采用静态状态转移矩阵而存在局限，无法适应事件稀疏度的动态变化。这种刚性机制导致建模失衡——对稀疏事件流欠拟合，对密集事件流过拟合——从而削弱了跨模态融合的鲁棒性。为突破这些限制，我们提出MambaTrack：一种基于动态状态空间模型（Dynamic State Space Model, DSSM）的多模态高效跟踪框架。我们的贡献主要体现在两方面。首先，我们设计了事件自适应的状态转移机制，能够根据事件流密度动态调节状态转移矩阵。通过可学习标量控制状态演化速率，实现对稀疏与稠密事件流的差异化建模。其次，我们开发了门控投影融合（Gated Projection Fusion, GPF）模块以实现鲁棒的跨模态集成。该模块将RGB特征投影至事件特征空间，并基于事件密度与RGB置信度生成自适应门控信号。这些门控精确调控融合强度，在抑制噪声的同时保留互补信息。实验表明，MambaTrack在FE108和FELT数据集上取得了最先进的性能。其轻量化设计展现出在实时嵌入式部署中的应用潜力。

摘要 (Abstract)

Existing Vision Mamba-based RGB-Event(RGBE) tracking methods suffer from using static state transition matrices, which fail to adapt to variations in event sparsity. This rigidity leads to imbalanced modeling-underfitting sparse event streams and overfitting dense ones-thus degrading cross-modal fusion robustness. To address these limitations, we propose MambaTrack, a multimodal and efficient tracking framework built upon a Dynamic State Space Model(DSSM). Our contributions are twofold. First, we introduce an event-adaptive state transition mechanism that dynamically modulates the state transition matrix based on event stream density. A learnable scalar governs the state evolution rate, enabling differentiated modeling of sparse and dense event flows. Second, we develop a Gated Projection Fusion(GPF) module for robust cross-modal integration. This module projects RGB features into the event feature space and generates adaptive gates from event density and RGB confidence scores. These gates precisely control the fusion intensity, suppressing noise while preserving complementary information. Experiments show that MambaTrack achieves state-of-the-art performance on the FE108 and FELT datasets. Its lightweight design suggests potential for real-time embedded deployment.

关键词: RGB-Event tracking, Vision Mamba, Dynamic State Space Model, event-adaptive state transition, Gated Projection Fusion, multimodal fusion, object tracking, real-time embedded deployment

96. ❌ MERRIN: A Benchmark for Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments

作者: Han Wang, David Wan, Hyunji Lee, Thinh Pham, Mikaela Cankosyan, Weiyuan Chen, Elias Stengel-Eskin, Tu Vu, Mohit Bansal 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13418v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	8.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文MERRIN专注于评估搜索增强型AI代理在嘈杂网络环境中的多模态证据检索和推理能力，与多个关键词高度相关：1）核心使用LLMs（如GPT-5.4-mini、Gemini、Qwen）作为代理的基础模型（8分）；2）直接涉及检索增强生成（RAG）技术，用于从网络检索多模态证据（10分）；3）评估代理的多跳推理能力，与Chain of Thought相关（8分）；4）代理执行工具使用（如网络搜索）进行证据检索（8分）；5）代理工作流程是核心研究对象（10分）；6）部分涉及深度推理（5分）和事实性挑战（5分）。其他关键词如MoE、量化、对齐等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了MERRIN基准，用于评估AI代理在嘈杂网络环境中进行多模态证据检索和推理的能力，发现现有代理平均准确率仅为22.3%，最佳代理仅达40.1%，且存在过度探索和过度依赖文本模态的问题。

摘要翻译

受搜索查询本身存在的未明确指定与多跳特性，以及现实网络结果多模态、异构且常相互冲突的特质所驱动，我们提出了MERRIN（嘈杂网络环境中的多模态证据检索与推理基准），这是一个用于评估搜索增强智能体的人工标注基准。MERRIN旨在衡量AI智能体在识别相关模态、检索多模态证据以及对嘈杂网络源进行多跳推理方面的能力。它在三个重要方面区别于先前工作：(1) 使用无明确模态提示的自然语言查询，(2) 纳入视频、音频等尚未被充分探索的模态，(3) 要求在网络搜索过程中检索复杂、常伴有噪声或相互冲突的多模态证据。我们评估了由十种模型驱动的多样化搜索智能体，包括强大的闭源模型（如GPT-5.4-mini、Gemini 3/3.1 Flash/Pro）和开源权重模型（Qwen3-4B/30B/235B），覆盖三种搜索设置（无搜索、原生搜索和智能体搜索）。结果显示，MERRIN极具挑战性：所有智能体的平均准确率仅为22.3%，表现最佳的智能体也仅达到40.1%。我们进一步观察到，尽管如Gemini Deep Research等更强的智能体取得了更高性能，但由于过度探索，其提升有限；它们采取了更多步骤、使用了更多工具，却常被相互冲突或部分相关的网络内容干扰，导致得出错误答案。与人类相比，这些智能体消耗了更多资源却获得更低准确率，这主要源于低效的源选择和对文本模态的过度依赖。这些发现凸显了在嘈杂网络环境中，亟需具备跨多模态进行稳健搜索与推理能力的搜索智能体，而MERRIN为此类能力的评估提供了一个宝贵的测试平台。

摘要 (Abstract)

Motivated by the underspecified, multi-hop nature of search queries and the multimodal, heterogeneous, and often conflicting nature of real-world web results, we introduce MERRIN (Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments), a human-annotated benchmark for evaluating search-augmented agents. MERRIN measures AI agents’ ability to identify relevant modalities, retrieve multimodal evidence, and perform multi-hop reasoning over noisy web sources. It differs from prior work in three important aspects: (1) using natural language queries without explicit modality cues, (2) incorporating underexplored modalities such as video and audio, and (3) requiring the retrieval of complex, often noisy or conflicting multimodal evidence during web search. We evaluate diverse search agents powered by ten models, including strong closed-source models (e.g., GPT-5.4-mini, Gemini 3/3.1 Flash/Pro) and open-weight models (Qwen3-4B/30B/235B), across three search settings (no search, native search, and agentic search). Our results show that MERRIN is highly challenging: the average accuracy across all agents is 22.3%, with the best-performing agent reaching only 40.1%. We further observe that while stronger agents like Gemini Deep Research achieve higher performance, gains are modest due to over-exploration; they take more steps and use more tools, but are often distracted by conflicting or partially relevant web content, leading to incorrect answers. Compared to humans, these agents consume more resources yet achieve lower accuracy, largely due to inefficient source selection and an overreliance on text modalities. These findings highlight the need for search agents capable of robust search and reasoning across diverse modalities in noisy web environments, making MERRIN a valuable testbed for evaluating such capabilities.

关键词: multimodal evidence retrieval, noisy web environments, search-augmented agents, multi-hop reasoning, benchmark evaluation, AI agents, tool use, conflicting evidence

97. ❌ The Cognitive Circuit Breaker: A Systems Engineering Framework for Intrinsic AI Reliability

作者: Jonathan Pan 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13417v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	8.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文直接研究LLMs的可靠性问题，特别是幻觉检测，与’Large Language Models’和’Hallucination Mitigation’高度相关（10分）。论文提到RAG作为现有方法，但提出替代方案，因此与’Retrieval-Augmented Generation’有一定关联（8分）。论文通过分析模型内部状态（隐藏状态、线性探针）来检测认知失调，涉及模型可解释性，与’Mechanistic Interpretability’相关（8分）。其他关键词如MoE、SLMs、训练方法、推理加速、AI for Science等未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该论文针对LLMs在关键系统中部署时的幻觉检测问题，提出了一种名为'认知断路器'的系统工程框架，通过计算模型外部语义置信度与内部潜在确定性之间的'认知失调增量'，实现了低延迟的内在可靠性监控。

摘要翻译

随着大型语言模型（LLM）越来越多地被部署在关键任务软件系统中，检测幻觉与“伪真实性”已成为一项至关重要的工程挑战。当前的可信性架构严重依赖于生成后的黑盒机制，例如检索增强生成（RAG）交叉验证或使用LLM作为评判者的评估器。这些外部方法引入了不可接受的延迟、高昂的计算开销，并且依赖于二次外部API调用，常常违反标准的软件工程服务等级协议（SLA）。本文提出“认知断路器”，一种新颖的系统工程框架，它能够以最小的延迟开销提供内在的可信性监控。通过在模型前向传播过程中提取隐藏状态，我们计算“认知失调差值”——即LLM外部语义置信度（softmax概率）与其内部潜在确定性（通过线性探针推导得出）之间的数学差距。我们展示了在统计上显著的认知失调检测能力，强调了架构相关的分布外（OOD）泛化特性，并证明该框架为活跃推理流程增加的计算开销可忽略不计。

摘要 (Abstract)

As Large Language Models (LLMs) are increasingly deployed in mission-critical software systems, detecting hallucinations and faked truthfulness'' has become a paramount engineering challenge. Current reliability architectures rely heavily on post-generation, black-box mechanisms, such as Retrieval-Augmented Generation (RAG) cross-checking or LLM-as-a-judge evaluators. These extrinsic methods introduce unacceptable latency, high computational overhead, and reliance on secondary external API calls, frequently violating standard software engineering Service Level Agreements (SLAs). In this paper, we propose the Cognitive Circuit Breaker, a novel systems engineering framework that provides intrinsic reliability monitoring with minimal latency overhead. By extracting hidden states during a model's forward pass, we calculate the Cognitive Dissonance Delta’’ – the mathematical gap between an LLM’s outward semantic confidence (softmax probabilities) and its internal latent certainty (derived via linear probes). We demonstrate statistically significant detection of cognitive dissonance, highlight architecture-dependent Out-of-Distribution (OOD) generalization, and show that this framework adds negligible computational overhead to the active inference pipeline.

关键词: Large Language Models, Hallucination Detection, Cognitive Dissonance, Intrinsic Reliability, Systems Engineering, Latency Overhead, Hidden States, Linear Probes

98. ❌ DF3DV-1K: A Large-Scale Dataset and Benchmark for Distractor-Free Novel View Synthesis

作者: Cheng-You Lu, Yi-Shan Hung, Wei-Ling Chi, Hao-Ping Wang, Charlie Li-Ting Tsai, Yu-Cheng Chang, Yu-Lun Liu, Thomas Do, Chin-Teng Lin 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13416v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的无干扰物新视角合成，提出了一个大规模数据集DF3DV-1K，并基于此对多种辐射场方法进行基准测试。论文内容涉及数据集构建、3D重建、图像增强等，但完全不涉及大语言模型、深度学习技术原理创新、AI for Science等关键词。所有关键词均与大模型、深度学习技术原理或科学AI应用无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个包含1048个场景的大规模无干扰物新视角合成数据集DF3DV-1K，用于基准测试多种辐射场方法，并展示了通过微调扩散模型增强辐射场性能的应用。

摘要翻译

辐射场技术的进步已能实现照片级真实感的新视角合成。在多个领域，大规模真实世界数据集的开发为综合性基准测试提供了支持，并推动了超越场景特定重建的进展。然而，针对无干扰物辐射场研究，目前仍缺乏一个包含每个场景的干净与杂乱图像的大规模数据集，这限制了该领域的发展。为填补这一空白，我们推出了DF3DV-1K——一个包含1,048个场景的大规模真实世界数据集，每个场景均提供用于基准测试的干净与杂乱图像集。该数据集总计包含89,924张图像，使用消费级相机拍摄以模拟日常采集，涵盖室内外环境中的128种干扰物类型和161种场景主题。我们还系统设计了一个包含41个场景的精选子集DF3DV-41，用于评估无干扰物辐射场方法在挑战性场景下的鲁棒性。基于DF3DV-1K，我们对九种近期无干扰物辐射场方法及3D高斯溅射（3D Gaussian Splatting）进行了基准测试，识别出最鲁棒的方法和最富挑战性的场景。除基准测试外，我们展示了DF3DV-1K的一项应用：通过微调基于扩散的二维增强器来改进辐射场方法，在保留测试集（如DF3DV-41）和On-the-go数据集上实现了平均0.96 dB PSNR和0.057 LPIPS的性能提升。我们希望DF3DV-1K能促进无干扰物视觉研究的发展，并推动超越场景特定方法的进步。

摘要 (Abstract)

Advances in radiance fields have enabled photorealistic novel view synthesis. In several domains, large-scale real-world datasets have been developed to support comprehensive benchmarking and to facilitate progress beyond scene-specific reconstruction. However, for distractor-free radiance fields, a large-scale dataset with clean and cluttered images per scene remains lacking, limiting the development. To address this gap, we introduce DF3DV-1K, a large-scale real-world dataset comprising 1,048 scenes, each providing clean and cluttered image sets for benchmarking. In total, the dataset contains 89,924 images captured using consumer cameras to mimic casual capture, spanning 128 distractor types and 161 scene themes across indoor and outdoor environments. A curated subset of 41 scenes, DF3DV-41, is systematically designed to evaluate the robustness of distractor-free radiance field methods under challenging scenarios. Using DF3DV-1K, we benchmark nine recent distractor-free radiance field methods and 3D Gaussian Splatting, identifying the most robust methods and the most challenging scenarios. Beyond benchmarking, we demonstrate an application of DF3DV-1K by fine-tuning a diffusion-based 2D enhancer to improve radiance field methods, achieving average improvements of 0.96 dB PSNR and 0.057 LPIPS on the held-out set (e.g., DF3DV-41) and the On-the-go dataset. We hope DF3DV-1K facilitates the development of distractor-free vision and promotes progress beyond scene-specific approaches.

关键词: novel view synthesis, radiance fields, large-scale dataset, distractor-free, 3D Gaussian Splatting, benchmarking, diffusion-based enhancer, computer vision

99. ❌ From Prediction to Justification: Aligning Sentiment Reasoning with Human Rationale via Reinforcement Learning

作者: Shihao Zhang, Ziwei Wang, Jie Zhou, Yulan Wu, Qin Chen, Zhikai Lei, Liyang Yu, Liang Dou, Liang He 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13398v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出ABSA-R1框架，使用大语言模型（LLMs）进行情感分析，核心创新在于通过强化学习（RLHF/RL）使模型生成推理路径（Chain of Thought/System 2 Thinking）来支持预测，提升可解释性（Explainable AI）。论文与LLMs、RLHF、CoT推理、系统2思维、可解释AI高度相关（10分），与对齐、自校正有一定关联（5分），其他关键词如MoE、量化、科学AI等未涉及（0分）。

!!! tip deepseek-chat TL;DR

该研究针对情感分析模型缺乏可解释推理能力的问题，提出基于大语言模型和强化学习的ABSA-R1框架，通过生成自然语言理由来支持情感预测，实验表明该方法在提升可解释性的同时提高了分类和三元组提取性能。

摘要翻译

尽管基于方面的情感分析系统在识别情感极性方面已实现较高准确度，但其运作往往如同“黑箱”，缺乏人类情感认知特有的显式推理能力。人类不仅对情感进行分类，还会为自身判断构建因果解释。为弥合这一差距，我们提出ABSA-R1——一个旨在模拟“先推理后预测”认知过程的大语言模型框架。通过强化学习技术，ABSA-R1能够学习阐释“为何如此判断”的内在逻辑，生成支撑其情感预测的自然语言论证。我们引入认知对齐奖励模型（原情感感知奖励模型），以强化生成推理路径与最终情感标签之间的一致性。此外，受元认知监控机制启发，我们实施了性能驱动的拒绝采样策略，该策略选择性聚焦于模型内部推理存在不确定性或矛盾性的困难案例。在四个基准数据集上的实验结果表明，赋予模型这种显式推理能力不仅能提升可解释性，相较于非推理基线模型，在情感分类与三元组提取任务中也实现了更优的性能表现。

摘要 (Abstract)

While Aspect-based Sentiment Analysis (ABSA) systems have achieved high accuracy in identifying sentiment polarities, they often operate as “black boxes,” lacking the explicit reasoning capabilities characteristic of human affective cognition. Humans do not merely categorize sentiment; they construct causal explanations for their judgments. To bridge this gap, we propose ABSA-R1, a large language model framework designed to mimic this ``reason-before-predict” cognitive process. By leveraging reinforcement learning (RL), ABSA-R1 learns to articulate the why behind the what, generating natural language justifications that ground its sentiment predictions. We introduce a Cognition-Aligned Reward Model (formerly sentiment-aware reward model) that enforces consistency between the generated reasoning path and the final emotional label. Furthermore, inspired by metacognitive monitoring, we implement a performance-driven rejection sampling strategy that selectively targets hard cases where the model’s internal reasoning is uncertain or inconsistent. Experimental results on four benchmarks demonstrate that equipping models with this explicit reasoning capability not only enhances interpretability but also yields superior performance in sentiment classification and triplet extraction compared to non-reasoning baselines.

关键词: Aspect-based Sentiment Analysis, Large Language Models, Reinforcement Learning, Reasoning, Interpretability, Cognition-Aligned Reward, Sentiment Classification, Triplet Extraction

100. ❌ Minimax Optimality and Spectral Routing for Majority-Vote Ensembles under Markov Dependence

作者: Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13414v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究多数投票集成在马尔可夫依赖数据下的理论分析和算法设计，属于传统机器学习/统计学习领域，不涉及大模型、深度学习或AI for Science的任何具体技术。所有关键词均与大模型、深度学习技术原理或科学AI应用相关，而本文专注于经典集成方法的统计理论，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

本文研究了在训练数据具有马尔可夫依赖时多数投票集成分类器的极小极大最优性，提出了自适应谱路由算法以匹配理论下界，并在合成数据、空间网格和时间序列数据上验证了其有效性。

摘要翻译

多数投票集成通过平均多样化且近似独立的基础学习器来实现方差缩减。当训练数据呈现马尔可夫依赖性时——例如在时间序列预测、强化学习（RL）回放缓冲区以及空间网格中——这一经典保证会以现有理论未能完全量化的方式发生退化。我们在固定维度的马尔可夫设定下，针对离散分类问题给出了该现象的极小极大刻画，并提出了一个在图正则子类上达到该速率的自适应算法。我们首先为固定环境维度中的平稳、可逆、几何遍历链建立了一个信息论下界，证明任何可测估计器都无法实现优于 $Ω(\sqrt{\Tmix/n})$ 的过量分类风险。随后，我们证明在下界构造所基于的 AR(1) 见证子类上，无视依赖性的均匀装袋法被证明是次优的，其过量风险下界为 $Ω(\Tmix/\sqrt{n})$，呈现出 $\sqrt{\Tmix}$ 的算法间隙。最后，我们提出 \emph{自适应谱路由} 方法，该方法通过依赖图的经验 Fiedler 特征向量对训练数据进行划分，并在图正则子类上以忽略低阶几何切割项为代价，实现了极小极大速率 $\mathcal{O}(\sqrt{\Tmix/n})$，且无需知晓 $\Tmix$。在合成马尔可夫链、二维空间网格、128 数据集的 UCR 档案以及 Atari DQN 集成上的实验验证了理论预测。关于深度 RL 目标方差、通过 Nyström 近似实现的可扩展性以及有界非平稳性的相关推论，作为支持材料在附录中给出。

摘要 (Abstract)

Majority-vote ensembles achieve variance reduction by averaging over diverse, approximately independent base learners. When training data exhibits Markov dependence, as in time-series forecasting, reinforcement learning (RL) replay buffers, and spatial grids, this classical guarantee degrades in ways that existing theory does not fully quantify. We provide a minimax characterization of this phenomenon for discrete classification in a fixed-dimensional Markov setting, together with an adaptive algorithm that matches the rate on a graph-regular subclass. We first establish an information-theoretic lower bound for stationary, reversible, geometrically ergodic chains in fixed ambient dimension, showing that no measurable estimator can achieve excess classification risk better than $Ω(\sqrt{\Tmix/n})$. We then prove that, on the AR(1) witness subclass underlying the lower-bound construction, dependence-agnostic uniform bagging is provably suboptimal with excess risk bounded below by $Ω(\Tmix/\sqrt{n})$, exhibiting a $\sqrt{\Tmix}$ algorithmic gap. Finally, we propose \emph{adaptive spectral routing}, which partitions the training data via the empirical Fiedler eigenvector of a dependency graph and achieves the minimax rate $\mathcal{O}(\sqrt{\Tmix/n})$ up to a lower-order geometric cut term on a graph-regular subclass, without knowledge of $\Tmix$. Experiments on synthetic Markov chains, 2D spatial grids, the 128-dataset UCR archive, and Atari DQN ensembles validate the theoretical predictions. Consequences for deep RL target variance, scalability via Nyström approximation, and bounded non-stationarity are developed as supporting material in the appendix.

关键词: majority-vote ensembles, Markov dependence, minimax optimality, spectral routing, classification risk, adaptive algorithm, time-series forecasting, reinforcement learning

101. ❌ Quantifying and Understanding Uncertainty in Large Reasoning Models

作者: Yangyi Li, Chenxu Zhao, Mengdi Huai 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13395v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究Large Reasoning Models（LRMs）的推理不确定性量化与解释，与’Large Language Models’高度相关（LRMs是LLMs的子类）。核心涉及推理过程，与’Chain of Thought’和’System 2 Thinking’直接对应。方法使用Shapley值进行解释，与’Mechanistic Interpretability’高度相关。不确定性量化与事实性/真实性（‘Hallucination Mitigation’）有一定关联。其他关键词如MoE、SFT、RAG等未在摘要中提及，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对大型推理模型（LRMs）在复杂推理任务中量化推理-答案结构不确定性的挑战，提出了一种具有统计保证的新方法，并开发了一个基于Shapley值的统一解释框架来识别训练示例和关键推理步骤，实验验证了方法的有效性。

摘要翻译

大型推理模型（LRMs）近期在复杂推理任务中展现出显著性能提升。虽然量化LRMs生成过程的不确定性至关重要，但传统方法往往存在不足，因其无法为推理-答案生成提供有限样本保证。共形预测（Conformal Prediction，CP）作为一种无分布且与模型无关的方法论脱颖而出，能够构建统计严格的不确定性集合。然而，现有CP方法忽略了推理轨迹与最终答案之间的逻辑关联。此外，先前研究未能阐释LRMs不确定性覆盖的来源，通常忽视了驱动有效推理的具体训练因素。值得注意的是，在量化不确定性时，将推理质量与答案正确性分离具有挑战性，同时还需为计算高效的解释方法建立理论保证。为应对这些挑战，我们首先提出一种新颖方法，可在统计保证下量化推理-答案结构的不确定性。随后，我们基于沙普利值开发了统一的示例到步骤解释框架，该框架可识别理论上充分的训练示例子集及其关键推理步骤，以保持原有统计保证。我们还对所提方法进行了理论分析。在多个具有挑战性的推理数据集上的大量实验验证了所提方法的有效性。

摘要 (Abstract)

Large Reasoning Models (LRMs) have recently demonstrated significant improvements in complex reasoning. While quantifying generation uncertainty in LRMs is crucial, traditional methods are often insufficient because they do not provide finite-sample guarantees for reasoning-answer generation. Conformal prediction (CP) stands out as a distribution-free and model-agnostic methodology that constructs statistically rigorous uncertainty sets. However, existing CP methods ignore the logical connection between the reasoning trace and the final answer. Additionally, prior studies fail to interpret the origins of uncertainty coverage for LRMs as they typically overlook the specific training factors driving valid reasoning. Notably, it is challenging to disentangle reasoning quality from answer correctness when quantifying uncertainty, while simultaneously establishing theoretical guarantees for computationally efficient explanation methods. To address these challenges, we first propose a novel methodology that quantifies uncertainty in the reasoning-answer structure with statistical guarantees. Subsequently, we develop a unified example-to-step explanation framework using Shapley values that identifies a provably sufficient subset of training examples and their key reasoning steps to preserve the guarantees. We also provide theoretical analyses of our proposed methods. Extensive experiments on challenging reasoning datasets verify the effectiveness of the proposed methods.

关键词: Large Reasoning Models, Uncertainty Quantification, Conformal Prediction, Reasoning Trace, Shapley Values, Statistical Guarantees, Explanation Framework, Complex Reasoning

102. ❌ On the Use of Evolutionary Optimization for the Dynamic Chance Constrained Open-Pit Mine Scheduling Problem

作者: Ishara Hewa Pathiranage, Aneta Neumann 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13385v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究露天矿调度优化问题，使用进化算法处理动态约束和不确定性，属于运筹学/工业工程领域。所有关键词均涉及大模型、深度学习及相关技术（如训练方法、推理优化、对齐、应用等），而本文完全不涉及这些内容，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文研究了动态机会约束露天矿调度问题，提出了一种基于多样性的变化响应机制，实验表明该方法在不同不确定性和变化频率下均优于基线方法。

摘要翻译

露天矿调度是一个复杂的现实世界优化问题，涉及不确定的经济价值和动态变化的资源能力。进化算法在此类场景中尤为有效，因为它们能够轻松适应不确定和变化的环境。然而，在现实问题中，不确定性与动态变化往往被孤立研究。本文研究了一个动态机会约束露天矿调度问题，其中矿块经济价值具有随机性，且开采与处理能力随时间变化。我们采用了一种双目标进化模型，同时最大化期望贴现利润并最小化其标准差。为应对动态变化，我们提出了一种基于多样性的变化响应机制：每当检测到变化时，该机制会修复部分不可行解并引入额外的可行解。我们在四种多目标进化算法中评估了该机制的有效性，并将其与基于重新评估的基准变化响应策略进行比较。在六个采矿实例上的实验结果表明，所提出的方法在不同不确定性水平和变化频率下均持续优于基准方法。

摘要 (Abstract)

Open-pit mine scheduling is a complex real world optimization problem that involves uncertain economic values and dynamically changing resource capacities. Evolutionary algorithms are particularly effective in these scenarios, as they can easily adapt to uncertain and changing environments. However, uncertainty and dynamic changes are often studied in isolation in real-world problems. In this paper, we study a dynamic chance-constrained open-pit mine scheduling problem in which block economic values are stochastic and mining and processing capacities vary over time. We adopt a bi-objective evolutionary formulation that simultaneously maximizes expected discounted profit and minimizes its standard deviation. To address dynamic changes, we propose a diversity-based change response mechanism that repairs a subset of infeasible solutions and introduces additional feasible solutions whenever a change is detected. We evaluate the effectiveness of this mechanism across four multi-objective evolutionary algorithms and compare it with a baseline re-evaluation-based change-response strategy. Experimental results on six mining instances demonstrate that the proposed approach consistently outperforms the baseline methods across different uncertainty levels and change frequencies.

关键词: open-pit mine scheduling, evolutionary algorithms, dynamic chance-constrained, bi-objective optimization, diversity-based change response, uncertainty, multi-objective evolutionary algorithms, stochastic block values

103. ❌ ReSS: Learning Reasoning Models for Tabular Data Prediction via Symbolic Scaffold

作者: Chenlang Yi, Gang Li, Zizhan Xiong, Tue Minh Cao, Yanmin Gong, My T. Thai, Tianbao Yang 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13392v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文ReSS提出了一种结合符号推理和神经推理的框架，使用决策树提取符号支架来指导LLM生成忠实推理，并微调LLM用于表格数据预测。核心相关关键词包括：LLMs（核心方法）、Supervised Fine-tuning（微调LLM）、Chain of Thought（生成自然语言推理）、Explainable AI（提高可解释性）、Hallucination Mitigation（评估忠实性）、AI for Science（应用于医疗和金融领域）。其他关键词如System 2 Thinking有间接关联（涉及深度推理），但非核心。其余关键词如MoE、SLMs、Scaling Laws等与论文内容无关。

!!! tip deepseek-chat TL;DR

论文ReSS解决了表格数据预测中需要高精度和忠实推理的挑战，通过结合符号支架和LLM微调，在医疗和金融基准上提升了模型性能并产生了可解释的推理。

摘要翻译

表格数据在医疗和金融等高风险领域仍占主导地位，这些领域要求预测模型兼具高准确性与忠实可靠、人类可理解的推理过程。尽管符号模型提供可验证的逻辑，但其语义表达能力不足。与此同时，通用大语言模型通常需要专门的微调才能掌握特定领域的表格推理。为应对可扩展数据构建与推理一致性的双重挑战，我们提出ReSS框架——一个连接符号与神经推理模型的系统性方法。ReSS利用决策树模型提取实例级决策路径作为符号支架。这些支架与输入特征及标签共同引导大语言模型生成严格遵循底层决策逻辑的、基于事实的自然语言推理。由此产生的高质量数据集用于将预训练大语言模型微调为专用表格推理模型，并通过支架不变的数据增强策略进一步提升其泛化能力与可解释性。为严格评估推理忠实度，我们引入包括幻觉率、解释必要性及解释充分性在内的量化指标。在医疗与金融基准测试上的实验结果表明，经ReSS训练的模型将传统决策树与标准微调方法的性能提升高达$10%$，同时生成忠实且一致的推理过程。

摘要 (Abstract)

Tabular data remains prevalent in high-stakes domains such as healthcare and finance, where predictive models are expected to provide both high accuracy and faithful, human-understandable reasoning. While symbolic models offer verifiable logic, they lack semantic expressiveness. Meanwhile, general-purpose LLMs often require specialized fine-tuning to master domain-specific tabular reasoning. To address the dual challenges of scalable data curation and reasoning consistency, we propose ReSS, a systematic framework that bridges symbolic and neural reasoning models. ReSS leverages a decision-tree model to extract instance-level decision paths as symbolic scaffolds. These scaffolds, alongside input features and labels, guide an LLM to generate grounded natural-language reasoning that strictly adheres to the underlying decision logic. The resulting high-quality dataset is used to fine-tune a pretrained LLM into a specialized tabular reasoning model, further enhanced by a scaffold-invariant data augmentation strategy to improve generalization and explainability. To rigorously assess faithfulness, we introduce quantitative metrics including hallucination rate, explanation necessity, and explanation sufficiency. Experimental results on medical and financial benchmarks demonstrate that ReSS-trained models improve traditional decision trees and standard fine-tuning approaches up to $10%$ while producing faithful and consistent reasoning

关键词: tabular data prediction, symbolic reasoning, large language models, fine-tuning, explainable AI, hallucination mitigation, decision trees, reasoning consistency

104. ❌ A 3D SAM-Based Progressive Prompting Framework for Multi-Task Segmentation of Radiotherapy-induced Normal Tissue Injuries in Limited-Data Settings

作者: Caiwen Jiang, Lei Zeng, Wei Liu 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13367v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于医学图像分割，特别是放疗引起的正常组织损伤的3D分割。它提出了一种基于3D SAM（Segment Anything Model）的渐进提示框架，用于有限数据设置下的多任务分割。论文的核心是计算机视觉和医学图像分析，而不是大语言模型（LLM）或深度学习技术原理的创新。所有关键词（除了最后一个）都直接涉及LLM、其训练方法、推理技术、对齐、代理或特定于LLM的优化。论文没有讨论LLM、MoE、缩放定律、预训练、微调、对齐、RLHF、PEFT、RAG、上下文窗口、注意力优化、推理技术、代理、工具使用、量化、推测解码、幻觉缓解、可解释性、世界模型、模型合并或上下文学习。然而，它确实属于“AI for Science”的范畴，具体是生物信息学/医学影像分析，因此该关键词得分为10分（高度相关）。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于3D SAM的渐进提示框架，用于在有限数据设置下分割放疗引起的正常组织损伤，并在头颈部损伤数据集上实现了优于现有方法的可靠分割性能。

摘要翻译

放疗诱发的正常组织损伤是临床上重要的并发症，从医学影像中准确分割损伤区域有助于疾病评估、治疗规划和纵向监测。然而，由于体素级标注有限，且损伤类型、病灶大小和成像模态之间存在显著异质性，这些病灶的自动分割研究仍处于探索不足的状态。为填补这一空白，我们构建了一个专用的头颈部放疗诱发正常组织损伤数据集，涵盖三种临床表现：放射性骨坏死（ORN）、脑水肿（CE）和放射性脑坏死（CRN）。我们进一步提出了一种基于3D SAM的渐进式提示框架，用于有限数据环境下的多任务分割。该框架逐步整合了三种互补的提示：用于任务感知适应的文本提示、用于粗定位的剂量引导框提示，以及用于迭代优化的点击提示。此外，引入了一种小目标聚焦损失函数，以提升对小而稀疏病灶的局部预测和边界勾勒能力。在ORN、CE和CRN上的实验表明，所提方法在不同损伤类型中均实现了可靠的分割性能，并优于现有先进方法。

摘要 (Abstract)

Radiotherapy-induced normal tissue injury is a clinically important complication, and accurate segmentation of injury regions from medical images could facilitate disease assessment, treatment planning, and longitudinal monitoring. However, automatic segmentation of these lesions remains largely unexplored because of limited voxel-level annotations and substantial heterogeneity across injury types, lesion size, and imaging modality. To address this gap, we curate a dedicated head-and-neck radiotherapy-induced normal tissue injury dataset covering three manifestations: osteoradionecrosis (ORN), cerebral edema (CE), and cerebral radiation necrosis (CRN). We further propose a 3D SAM-based progressive prompting framework for multi-task segmentation in limited-data settings. The framework progressively incorporates three complementary prompts: text prompts for task-aware adaptation, dose-guided box prompts for coarse localization, and click prompts for iterative refinement. A small-target focus loss is introduced to improve local prediction and boundary delineation for small and sparse lesions. Experiments on ORN, CE, and CRN demonstrate that the proposed method achieves reliable segmentation performance across diverse injury types and outperforms state-of-the-art methods.

关键词: 3D SAM, progressive prompting, multi-task segmentation, radiotherapy-induced normal tissue injury, limited-data settings, medical image segmentation, head-and-neck dataset, small-target focus loss

105. ❌ Peer-Predictive Self-Training for Language Model Reasoning

作者: Shi Feng, Hanlin Zhang, Fan Nie, Sham Kakade, Yiling Chen 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.13356v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	10.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出Peer-Predictive Self-Training (PST)框架，属于大模型/小语言模型在数学推理任务上的自我改进方法。核心相关关键词：1) 论文明确使用Gemma-2-2B、LLaMA-3.2-1B、Qwen-2.5-1.5B等模型，属于LLMs/SLMs范畴（权重1.0，评分10）；2) 方法属于post-training/fine-tuning范畴（权重1.0，评分10）；3) 应用于数学推理任务，涉及多步推理（Chain of Thought，权重1.0，评分10）；4) 核心是自我改进机制（Self-Correction/Self-Improvement，权重1.0，评分10）。其他关键词如MoE、Scaling Laws、Instruction Tuning等未在论文中涉及，评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种无监督的Peer-Predictive Self-Training框架，通过多个语言模型协作生成聚合答案作为内部训练信号，在数学推理基准上显著提高了小语言模型的准确率并减少了生成-验证差距。

摘要翻译

语言模型无需外部监督的持续自我改进机制仍是一个开放挑战。我们提出同伴预测自训练（Peer-Predictive Self-Training，PST），这是一种无标签微调框架，其中多个语言模型通过利用跨模型聚合响应作为内部训练信号进行协作提升。给定一个提示问题，模型依次生成响应；最终聚合的答案在实践中通常比单个响应更可靠，可作为内部学习目标。我们使用点互信息（pointwise mutual information，PMI）衡量每个中间响应关于聚合结果的信息量，并利用该信号缩放自训练更新。已与聚合结果对齐的响应更新较少，而信息量较低或未对齐的响应则更新更多。在数学推理基准测试（SimulEq、Math500和MultiArith）中，PST将Gemma-2-2B、LLaMA-3.2-1B和Qwen-2.5-1.5B的精确匹配准确率提升了2.2至4.3个百分点，并将平均生成器-验证器差距（generator-verifier gap，GV-Gap）降低了26%至40%，且无需外部监督或师生层级结构，仅依赖跨模型交互。这些结果表明，跨模型生成与同伴预测反馈可作为一种有效的自监督训练途径。

摘要 (Abstract)

Mechanisms for continued self-improvement of language models without external supervision remain an open challenge. We propose Peer-Predictive Self-Training (PST), a label-free fine-tuning framework in which multiple language models improve collaboratively by leveraging a cross-model aggregated response as an internal training signal. Given a prompt question, the models generate responses sequentially; the final aggregated answer, often more reliable than individual responses in practice, serves as an internal target for learning. We measure how informative each intermediate response is about the aggregate using pointwise mutual information (PMI), and use this signal to scale self-training updates. Responses already aligned with the aggregate are updated less, while less informative or misaligned responses are updated more. On mathematical reasoning benchmarks (SimulEq, Math500, and MultiArith), PST improves exact-match accuracy by 2.2 to 4.3 percentage points across Gemma-2-2B, LLaMA-3.2-1B, and Qwen-2.5-1.5B, and reduces the average generator-verifier gap (GV-Gap) by 26 to 40 percent, while requiring no external supervision or teacher-student hierarchy and relying solely on cross-model interactions. These results suggest that cross-model generations and peer-predictive feedback can serve as an effective approach for self-supervised training.

关键词: self-training, language model reasoning, peer-predictive, label-free fine-tuning, mathematical reasoning, cross-model aggregation, self-improvement, internal training signal

106. ❌ Finetuning-Free Diffusion Model with Adaptive Constraint Guidance for Inorganic Crystal Structure Generation

作者: Auguste de Lambilly, Vladimir Baturin, David Portehault, Guillaume Lambard, Nataliya Sokolovska, Florence d’Alché-Buc, Jean-Claude Crivello 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.13354v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于使用扩散模型进行无机晶体结构生成，属于AI在科学领域的应用（材料科学）。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、智能体等）完全无关。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在材料科学（可视为科学计算或化学信息学相关领域）的应用，但并非其核心创新点（核心是扩散模型和约束引导），因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该研究提出了一种基于自适应约束引导的扩散模型框架，用于生成满足特定物理化学约束的热力学稳定的无机晶体结构，并通过多步验证流程确保了生成结果的可靠性。

摘要翻译

在材料科学中，发现具有目标性质的无机晶体结构是一项重大挑战。生成模型，尤其是最先进的扩散模型，为模拟复杂数据分布和提出新颖、真实的样本提供了可能。然而，当前的生成式人工智能模型仍难以产生适用于高风险应用的、多样化、原创且可靠的、可通过实验实现的材料结构。
在本研究中，我们提出了一种基于扩散模型的自适应约束引导生成式机器学习框架，该框架能够在生成过程中融入用户定义的物理和化学约束。这一方法旨在为人类专家提供实用且可解释的工具，支持透明的决策制定和专家驱动的探索。为确保生成候选结构的稳健性与有效性，我们引入了一个多步骤验证流程，该流程结合了训练至达到密度泛函理论精度的图神经网络估计器，以及用于评估热力学稳定性的凸包分析。我们的方法已在多个经典的无机化合物家族案例研究中得到测试与验证。初步结果表明，我们的框架能够生成热力学上合理的晶体结构，这些结构满足跨不同无机化学体系的目标几何约束。

摘要 (Abstract)

The discovery of inorganic crystal structures with targeted properties is a significant challenge in materials science. Generative models, especially state-of-the-art diffusion models, offer the promise of modeling complex data distributions and proposing novel, realistic samples. However, current generative AI models still struggle to produce diverse, original, and reliable structures of experimentally achievable materials suitable for high-stakes applications. In this work, we propose a generative machine learning framework based on diffusion models with adaptive constraint guidance, which enables the incorporation of user-defined physical and chemical constraints during the generation process. This approach is designed to be practical and interpretable for human experts, allowing transparent decision-making and expert-driven exploration. To ensure the robustness and validity of the generated candidates, we introduce a multi-step validation pipeline that combines graph neural network estimators trained to achieve DFT-level accuracy and convex hull analysis for assessing thermodynamic stability. Our approach has been tested and validated on several classical examples of inorganic families of compounds, as case studies. As a consequence, these preliminary results demonstrate our framework’s ability to generate thermodynamically plausible crystal structures that satisfy targeted geometric constraints across diverse inorganic chemical systems.

关键词: diffusion models, inorganic crystal structure generation, adaptive constraint guidance, thermodynamic stability, graph neural network, materials science, generative AI, validation pipeline

107. ❌ WebXSkill: Skill Learning for Autonomous Web Agents

作者: Zhaoyang Wang, Qianhui Wu, Xuchao Zhang, Chaoyun Zhang, Wenlin Yao, Fazle Elahi Faisal, Baolin Peng, Si Qin, Suman Nath, Qingwei Lin, Chetan Bansal, Dongmei Zhang, Saravan Rajmohan, Jianfeng Gao, Huaxiu Yao 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.13318v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究基于LLM的自主网络代理（LLM Agents）的技能学习框架，直接高度相关关键词包括：LLMs（10分）、LLM Agents（10分）、Tool Use（10分）。论文涉及代理的规划、推理和适应能力，与Chain of Thought、System 2 Thinking、Self-Correction有一定关联（各5分）。其他关键词如MoE、SLMs、Scaling Laws、训练方法、RAG、压缩、科学AI等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

论文针对基于大语言模型的自主网络代理在长流程任务中存在的技能落地差距问题，提出了WebXSkill框架，通过将参数化动作程序与自然语言指导配对创建可执行技能，在WebArena和WebVoyager基准上分别将任务成功率提高了9.8和12.9个百分点。

摘要翻译

由大型语言模型驱动的自主网络代理在完成复杂浏览器任务方面展现出潜力，但其在长流程工作处理中仍面临困难。现有技能范式的核心瓶颈在于其落地鸿沟：文本化工作流技能虽能提供自然语言指导却无法直接执行，而基于代码的技能虽可执行但对代理而言不透明，无法提供用于错误恢复或自适应调整的步骤级理解。我们提出WebXSkill框架，通过可执行技能弥合这一鸿沟——每个技能将参数化的动作程序与步骤级自然语言指导相结合，既能直接执行，也支持代理驱动的自适应调整。WebXSkill运行分为三个阶段：技能提取从易得的合成代理轨迹中挖掘可复用的动作子序列，并将其抽象为参数化技能；技能组织将技能索引至基于URL的图谱中，实现情境感知检索；技能部署提供两种互补模式——完全自动化多步执行的落地模式，以及将技能作为逐步指令、由代理通过原生规划进行跟随的引导模式。在WebArena和WebVoyager基准测试中，WebXSkill分别将任务成功率较基线提升9.8和12.9个百分点，验证了可执行技能对网络代理的有效性。代码已公开于https://github.com/aiming-lab/WebXSkill。

摘要 (Abstract)

Autonomous web agents powered by large language models (LLMs) have shown promise in completing complex browser tasks, yet they still struggle with long-horizon workflows. A key bottleneck is the grounding gap in existing skill formulations: textual workflow skills provide natural language guidance but cannot be directly executed, while code-based skills are executable but opaque to the agent, offering no step-level understanding for error recovery or adaptation. We introduce WebXSkill, a framework that bridges this gap with executable skills, each pairing a parameterized action program with step-level natural language guidance, enabling both direct execution and agent-driven adaptation. WebXSkill operates in three stages: skill extraction mines reusable action subsequences from readily available synthetic agent trajectories and abstracts them into parameterized skills, skill organization indexes skills into a URL-based graph for context-aware retrieval, and skill deployment exposes two complementary modes, grounded mode for fully automated multi-step execution and guided mode where skills serve as step-by-step instructions that the agent follows with its native planning. On WebArena and WebVoyager, WebXSkill improves task success rate by up to 9.8 and 12.9 points over the baseline, respectively, demonstrating the effectiveness of executable skills for web agents. The code is publicly available at https://github.com/aiming-lab/WebXSkill.

关键词: autonomous web agents, large language models, skill learning, executable skills, parameterized action programs, step-level guidance, task success rate, WebArena

108. ❌ Beyond Uniform Sampling: Synergistic Active Learning and Input Denoising for Robust Neural Operators

作者: Samrendra Roy, Souvik Chakraborty, Syed Bahauddin Alam 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.13316v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究神经算子（Neural Operators）在物理模拟中的对抗鲁棒性问题，提出了一种结合主动学习和输入去噪的防御方法。论文主题是深度学习在科学计算（具体是偏微分方程求解）中的应用，属于AI for Science范畴，因此与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（评分5分）。然而，论文完全不涉及大语言模型（LLMs）、MoE、小模型、缩放定律、预训练/后训练、对齐、RLHF、PEFT、RAG、上下文扩展、注意力优化、推理技术、智能体、量化、推理加速、幻觉缓解、可解释性、世界模型、模型合并、上下文学习等关键词，这些关键词均与大模型技术或特定应用相关，而本文研究的是针对物理模拟的专用神经算子模型，因此其他所有关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对神经算子在物理模拟中易受对抗攻击的问题，提出了一种结合主动学习数据生成和输入去噪架构的协同防御方法，在Burgers方程基准上显著降低了总误差并揭示了架构依赖的脆弱性模式。

摘要翻译

神经算子作为物理仿真的快速代理模型已崭露头角，但其对对抗性扰动仍极度脆弱，这对安全关键的数字孪生部署构成重大隐患。本文提出一种协同防御策略，将基于主动学习的数据生成与输入去噪架构相结合。主动学习组件通过差分进化攻击自适应地探测模型弱点，随后在发现的脆弱位置生成针对性训练数据，同时采用自适应平滑比保护机制维持基准精度。输入去噪组件则在算子架构中嵌入可学习的瓶颈结构，在过滤对抗性噪声的同时保留物理相关特征。在粘性伯格斯方程基准测试中，该组合方法实现了2.04%的综合误差（1.21%基准误差+0.83%鲁棒性误差），相较于标准训练（15.42%综合误差）降低了87%，且优于单独使用主动学习（3.42%）或单独使用输入去噪（5.22%）的效果。更广泛而言，本研究结果结合先前工作的跨架构脆弱性分析表明：神经算子的最优训练数据具有架构依赖性——由于不同架构将敏感性集中于不同的输入子空间，均匀采样无法充分覆盖所有模型的脆弱性图谱。这些发现对神经算子在核反应堆监测等安全关键能源系统中的部署具有潜在启示意义。

摘要 (Abstract)

Neural operators have emerged as fast surrogate models for physics simulations, yet they remain acutely vulnerable to adversarial perturbations, a critical liability for safety-critical digital twin deployments. We present a synergistic defense that combines active learning-based data generation with an input denoising architecture. The active learning component adaptively probes model weaknesses using differential evolution attacks, then generates targeted training data at discovered vulnerability locations while an adaptive smooth-ratio safeguard preserves baseline accuracy. The input denoising component augments the operator architecture with a learnable bottleneck that filters adversarial noise while retaining physics-relevant features. On the viscous Burgers’ equation benchmark, the combined approach achieves a 2.04% combined error (1.21% baseline + 0.83% robustness), representing an 87% reduction relative to standard training (15.42% combined) and outperforming both active learning alone (3.42%) and input denoising alone (5.22%). More broadly, our results, combined with cross-architecture vulnerability analysis from prior work, suggest that optimal training data for neural operators is architecture-dependent: because different architectures concentrate sensitivity in distinct input subspaces, uniform sampling cannot adequately cover the vulnerability landscape of all models. These findings have potential implications for the deployment of neural operators in safety-critical energy systems including nuclear reactor monitoring.

关键词: Neural Operators, Adversarial Robustness, Active Learning, Input Denoising, Physics Simulations, Digital Twins, Burgers’ Equation, Safety-critical Systems

109. ❌ Giving Voice to the Constitution: Low-Resource Text-to-Speech for Quechua and Spanish Using a Bilingual Legal Corpus

作者: John E. Ortega, Rodolfo Zevallos, Fabricio Carraro 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.13288v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于文本转语音（TTS）技术，特别是针对克丘亚语和西班牙语的双语法律语料库，使用XTTS v2、F5-TTS和DiFlow-TTS等TTS架构。论文内容涉及低资源语言处理、跨语言迁移和语音合成，但所有评分关键词均与大模型、深度学习技术原理或科学AI应用直接相关，而本论文未涉及这些主题。例如，关键词如’Large Language Models’、‘Mixture of Experts’、‘AI for Science’等均与TTS或法律语料库无关，因此所有关键词相关度评分为0。

!!! tip deepseek-chat TL;DR

该论文开发了一个统一的管道，利用XTTS v2、F5-TTS和DiFlow-TTS等先进TTS架构，为秘鲁宪法合成高质量的克丘亚语和西班牙语语音，通过跨语言迁移缓解克丘亚语数据稀缺问题，并发布了训练模型和合成音频资源。

摘要翻译

本文提出了一种统一框架，利用三种先进文本转语音（TTS）架构——XTTS v2、F5-TTS与DiFlow-TTS——为《秘鲁宪法》合成高质量的克丘亚语与西班牙语语音。我们的模型基于规模及录制条件各异的独立西班牙语和克丘亚语语音数据集进行训练，并通过双语及多语言TTS能力提升两种语言的合成质量。该框架借助跨语言迁移技术，在缓解克丘亚语数据稀缺问题的同时，保持了西班牙语语音的自然度。我们公开了训练完成的模型检查点、推理代码以及每条宪法条款的合成音频，为原住民语言及多语言场景下的语音技术提供了可复用的资源。此项研究为推动低资源环境下政治与法律内容的包容性TTS系统发展作出了贡献。

摘要 (Abstract)

We present a unified pipeline for synthesizing high-quality Quechua and Spanish speech for the Peruvian Constitution using three state-of-the-art text-to-speech (TTS) architectures: XTTS v2, F5-TTS, and DiFlow-TTS. Our models are trained on independent Spanish and Quechua speech datasets with heterogeneous sizes and recording conditions, and leverage bilingual and multilingual TTS capabilities to improve synthesis quality in both languages. By exploiting cross-lingual transfer, our framework mitigates data scarcity in Quechua while preserving naturalness in Spanish. We release trained checkpoints, inference code, and synthesized audio for each constitutional article, providing a reusable resource for speech technologies in indigenous and multilingual contexts. This work contributes to the development of inclusive TTS systems for political and legal content in low-resource settings.

关键词: text-to-speech, Quechua, Spanish, bilingual TTS, low-resource, cross-lingual transfer, legal corpus, speech synthesis

110. ❌ Can Cross-Layer Transcoders Replace Vision Transformer Activations? An Interpretable Perspective on Vision

作者: Gerasimos Chatzoudis, Konstantinos D. Polyzos, Zhuowei Li, Difei Gu, Gemma E. Moran, Hao Wang, Dimitris N. Metaxas 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.13304v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于Vision Transformers（ViTs）的可解释性研究，提出Cross-Layer Transcoders（CLTs）作为MLP块的代理模型，以实现层间贡献的忠实归因和过程级可解释性。这与关键词’Mechanistic Interpretability OR Explainable AI’高度相关（10分），因为论文核心是开发可解释AI方法。其他关键词主要涉及大语言模型（LLMs）的技术、训练、对齐、推理、代理等，而本文研究的是视觉Transformer（ViT），属于计算机视觉领域，未涉及LLMs、MoE、缩放定律、训练技术、推理方法、代理系统等，因此相关度为0分。

!!! tip deepseek-chat TL;DR

论文研究了Vision Transformers（ViTs）的内部激活可解释性问题，提出Cross-Layer Transcoders（CLTs）作为稀疏、深度感知的代理模型，能够将ViT的最终表示分解为层解析的加性构造，实现忠实归因并保持分类准确性。

摘要翻译

理解视觉Transformer（ViT）的内部激活对于构建可解释且可信赖的模型至关重要。尽管稀疏自编码器（SAE）已被用于提取人类可解释的特征，但它们仅针对单个层进行操作，未能捕捉Transformer的跨层计算结构，以及各层在形成最终层表示中的相对重要性。为此，我们引入跨层转码器（Cross-Layer Transcoders, CLTs）作为ViT中多层感知机（MLP）模块的可靠、稀疏且具有深度感知能力的代理模型。CLTs采用编码器-解码器方案，从学习到的前序层稀疏嵌入中重建每个MLP后激活，从而产生一种线性分解。该分解将ViT的最终表示从一种不透明的嵌入，转化为一种可加性的、按层解析的构造，实现了忠实归因和过程级的可解释性。我们在CIFAR-100、COCO和ImageNet-100数据集上，针对CLIP ViT-B/32和ViT-B/16模型训练了CLTs。实验表明，CLTs在重建MLP后激活时达到了很高的保真度，同时保持并在某些情况下甚至提升了CLIP的零样本分类准确率。在可解释性方面，我们证明跨层贡献分数提供了忠实的归因，揭示了最终表示集中于一小部分占主导地位的逐层项中：移除这些项会降低模型性能，而保留它们则能基本维持性能。这些结果彰显了在视觉领域采用CLTs作为ViT的一种替代性可解释代理模型的重要意义。

摘要 (Abstract)

Understanding the internal activations of Vision Transformers (ViTs) is critical for building interpretable and trustworthy models. While Sparse Autoencoders (SAEs) have been used to extract human-interpretable features, they operate on individual layers and fail to capture the cross-layer computational structure of Transformers, as well as the relative significance of each layer in forming the last-layer representation. Alternatively, we introduce the adoption of Cross-Layer Transcoders (CLTs) as reliable, sparse, and depth-aware proxy models for MLP blocks in ViTs. CLTs use an encoder-decoder scheme to reconstruct each post-MLP activation from learned sparse embeddings of preceding layers, yielding a linear decomposition that transforms the final representation of ViTs from an opaque embedding into an additive, layer-resolved construction that enables faithful attribution and process-level interpretability. We train CLTs on CLIP ViT-B/32 and ViT-B/16 across CIFAR-100, COCO, and ImageNet-100. We show that CLTs achieve high reconstruction fidelity with post-MLP activations while preserving and even improving, in some cases, CLIP zero-shot classification accuracy. In terms of interpretability, we show that the cross-layer contribution scores provide faithful attribution, revealing that the final representation is concentrated in a smaller set of dominant layer-wise terms whose removal degrades performance and whose retention largely preserves it. These results showcase the significance of adopting CLTs as an alternative interpretable proxy of ViTs in the vision domain.

关键词: Vision Transformers, Interpretability, Cross-Layer Transcoders, Sparse Autoencoders, Layer-wise Attribution, CLIP, Proxy Models, Activation Reconstruction

111. ❌ Correct Prediction, Wrong Steps? Consensus Reasoning Knowledge Graph for Robust Chain-of-Thought Synthesis

作者: Zipeng Ling, Shuliang Liu, Shenghong Fu, Yuehao Tang, Seonil Son, Yao Wan, Xuming Hu 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.14121v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM的推理过程改进，直接涉及Chain of Thought（CoT）推理（10分），通过构建推理知识图谱和拓扑生成来提升推理质量，这与System 2 Thinking/深度推理（8分）和自我纠正/自我改进（8分）高度相关。论文还涉及减轻幻觉（8分）和可解释AI（5分）方面。论文明确使用LLM（10分），但未涉及其他关键词如MoE、SFT、RAG等具体技术。

!!! tip deepseek-chat TL;DR

该论文针对LLM推理过程中存在的步骤内部缺陷和步骤间缺陷问题，提出了CRAFT框架，通过构建推理知识图谱和拓扑生成来合成高质量推理轨迹，在逻辑和数学推理基准上平均提升10%以上的准确率。

摘要翻译

大语言模型的推理轨迹存在复杂的缺陷——包括步骤内部缺陷（逻辑错误、幻觉等）和步骤间缺陷（过度思考、思考不足），这些缺陷因样本而异。一种自然的解决思路是提供真实标签来引导大语言模型的推理。但与直觉相反，我们发现这并未提升模型的推理能力。为此，我们提出了CRAFT，一个能够同时缓解两类步骤缺陷的统一框架。该框架基于多个候选推理轨迹的共识部分构建推理知识图谱（Reasoning Knowledge Graph, RKG），并通过拓扑生成合成高质量的推理轨迹。我们的方法在标签预测准确率上平均提升了10%以上，并在逻辑推理和数学推理基准测试中持续优于所有基线模型。此外，详细的基准评估证明，我们的方法还在多个维度上提升了大语言模型推理轨迹的质量。

摘要 (Abstract)

LLM reasoning traces suffer from complex flaws – Step Internal Flaws (logical errors, hallucinations, etc.) and Step-wise Flaws (overthinking, underthinking), which vary by sample. A natural approach would be to provide ground-truth labels to guide LLMs’ reasoning. Contrary to intuition, we show that this yields no improvement in reasoning ability. We then propose CRAFT, a unified framework that mitigates both types of Step flaws, which builds a Reasoning Knowledge Graph (RKG) based on the consensus parts of multiple candidate traces, and synthesizes a high-quality trace through topological generation. Our approach improves label-prediction accuracy by 10+% on average, and consistently outperforms all baselines across both logical and mathematical reasoning benchmarks. Further, detailed benchmark evaluation proves that our method also improves the quality of LLMs’ reasoning traces in multiple dimensions.

关键词: Chain-of-Thought, Reasoning Knowledge Graph, LLM reasoning, Step flaws, Topological generation, Logical reasoning, Mathematical reasoning, CRAFT framework

112. ❌ SpatialEvo: Self-Evolving Spatial Intelligence via Deterministic Geometric Environments

作者: Dinging Li, Yingxiu Zhao, Xinrui Cheng, Kangheng Lin, Hongbo Peng, Hongxing Li, Zixuan Wang, Yuhong Dai, Haodong Li, Jia Wang, Yukang Shi, Liang Zhao, Jianjian Sun, Zheng Ge, Xiangyu Zhang, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.14144v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于3D空间推理的自进化框架，核心是确定性几何环境（DGE）和共享参数策略的协同进化。与绝大多数关键词无关，因为这些关键词主要涉及大语言模型（LLM）的技术原理、训练方法、优化技术、推理加速、对齐、代理系统等。论文未提及LLM、MoE、缩放律、预训练/后训练、对齐技术、PEFT、RAG、上下文扩展、注意力优化、思维链、系统2思维、MCTS、量化、推测解码、幻觉缓解、可解释性、世界模型、模型合并、上下文学习等。仅与“Self-Correction OR Self-Improvement OR Self-Reflection”高度相关（10分），因为论文核心是“self-evolving”框架，通过DGE提供客观物理反馈来实现模型的自我改进和纠正几何错误，这直接对应自校正、自改进的概念。论文虽涉及AI在3D场景的应用，但未具体针对生物信息学或化学信息学等科学子领域，因此“AI for Science”等得0分。

!!! tip deepseek-chat TL;DR

该论文针对3D空间推理中几何标注成本高的问题，提出了一个基于确定性几何环境（DGE）的自进化框架SpatialEvo，通过客观物理反馈替代模型共识来生成无噪声训练数据，使模型在多个基准测试中取得了最先进的性能。

摘要翻译

三维场景的空间推理是具身智能的核心能力，但持续的模型改进始终受限于几何标注的成本。自演进范式提供了一条有前景的路径，但其依赖模型共识来构建伪标签的做法，会导致训练过程强化而非纠正模型自身的几何错误。我们识别出三维空间推理所独有的一种特性，可以规避这一局限：真实标注是底层几何结构的确定性结果，可直接从点云和相机位姿精确计算得出，无需任何模型参与。基于这一洞见，我们提出了SpatialEvo，一个用于三维空间推理的自演进框架，其核心是确定性几何环境（Deterministic Geometric Environment, DGE）。DGE将16类空间推理任务形式化于明确的几何验证规则之下，并将未标注的三维场景转化为零噪声的交互式验证器，用客观的物理反馈取代了模型共识。一个共享参数的策略模型在DGE约束下，通过提问者与求解者双重角色协同演进：提问者基于场景观察生成物理上有效的空间问题，而求解者则依据DGE验证的真实标注推导精确答案。一个任务自适应调度器内生地将训练集中在模型最薄弱的任务类别上，无需人工设计即可生成动态课程。在九个基准测试上的实验表明，SpatialEvo在3B和7B规模下均取得了最高的平均分数，在空间推理基准上获得一致提升，且在通用视觉理解能力上未出现退化。

摘要 (Abstract)

Spatial reasoning over three-dimensional scenes is a core capability for embodied intelligence, yet continuous model improvement remains bottlenecked by the cost of geometric annotation. The self-evolving paradigm offers a promising path, but its reliance on model consensus to construct pseudo-labels causes training to reinforce rather than correct the model’s own geometric errors. We identify a property unique to 3D spatial reasoning that circumvents this limitation: ground truth is a deterministic consequence of the underlying geometry, computable exactly from point clouds and camera poses without any model involvement. Building on this insight, we present SpatialEvo, a self-evolving framework for 3D spatial reasoning, centered on the Deterministic Geometric Environment (DGE). The DGE formalizes 16 spatial reasoning task categories under explicit geometric validation rules and converts unannotated 3D scenes into zero-noise interactive oracles, replacing model consensus with objective physical feedback. A single shared-parameter policy co-evolves across questioner and solver roles under DGE constraints: the questioner generates physically valid spatial questions grounded in scene observations, while the solver derives precise answers against DGE-verified ground truth. A task-adaptive scheduler endogenously concentrates training on the model’s weakest categories, producing a dynamic curriculum without manual design. Experiments across nine benchmarks demonstrate that SpatialEvo achieves the highest average score at both 3B and 7B scales, with consistent gains on spatial reasoning benchmarks and no degradation on general visual understanding.

关键词: Spatial Reasoning, 3D Scenes, Self-Evolving, Deterministic Geometric Environment, Geometric Validation, Co-evolution, Task-adaptive Scheduler, Embodied Intelligence

113. ❌ From Weights to Activations: Is Steering the Next Frontier of Adaptation?

作者: Simon Ostermann, Daniil Gurgurov, Tanja Baeumel, Michael A. Hedderich, Sebastian Lapuschkin, Wojciech Samek, Vera Schmitt 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.14090v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	5.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究语言模型的后期适应方法，特别是通过修改内部激活（steering）来影响模型行为，这与传统参数更新方法（如微调）形成对比。因此，与’Post-training OR Supervised Fine-tuning OR SFT’高度相关（10分），因为论文直接讨论并比较了这些方法；与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为研究基于语言模型；与’PEFT OR LoRA OR Parameter-efficient Fine-tuning’有一定关联（5分），因为论文提到了参数高效适应作为对比方法之一。其他关键词如MoE、SLMs、Scaling Laws、RAG、RLHF等均未在摘要中提及或相关，故评0分。

!!! tip deepseek-chat TL;DR

该论文提出将steering（通过修改内部激活影响模型行为）视为一种新的模型适应范式，并通过功能标准将其与传统方法（如微调）进行比较，建立了统一的模型适应分类框架。

摘要翻译

语言模型的训练后适配通常通过参数更新或基于输入的方法实现，例如微调、参数高效适配和提示。与此同时，越来越多的研究通过在推理时修改内部激活来影响模型行为，这种方法被称为引导。尽管引导的应用日益广泛，但很少在既有适配方法的同一概念框架内对其进行分析。
本文主张引导应被视为一种模型适配形式。我们提出了一套适配方法的功能性标准，并以此比较引导方法与经典替代方案。该分析将引导定位为一种基于激活空间定向干预的独特适配范式，它能够在无需参数更新的情况下实现局部且可逆的行为改变。由此形成的框架阐明了引导与现有方法的关系，推动建立模型适配的统一分类体系。

摘要 (Abstract)

Post-training adaptation of language models is commonly achieved through parameter updates or input-based methods such as fine-tuning, parameter-efficient adaptation, and prompting. In parallel, a growing body of work modifies internal activations at inference time to influence model behavior, an approach known as steering. Despite increasing use, steering is rarely analyzed within the same conceptual framework as established adaptation methods. In this work, we argue that steering should be regarded as a form of model adaptation. We introduce a set of functional criteria for adaptation methods and use them to compare steering approaches with classical alternatives. This analysis positions steering as a distinct adaptation paradigm based on targeted interventions in activation space, enabling local and reversible behavioral change without parameter updates. The resulting framing clarifies how steering relates to existing methods, motivating a unified taxonomy for model adaptation.

关键词: steering, model adaptation, post-training adaptation, internal activations, fine-tuning, parameter-efficient adaptation, activation space, unified taxonomy

114. ❌ From Where Words Come: Efficient Regularization of Code Tokenizers Through Source Attribution

作者: Pavel Chizhov, Egor Bogomolov, Ivan P. Yamshchikov 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.14053v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM的tokenization技术，特别是代码tokenizer的效率问题，因此与’Large Language Models’高度相关（10分）。论文提到tokenizer质量影响hallucination风险，与’Hallucination Mitigation’有一定关联（5分）。其他关键词如MoE、SFT、RAG等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了代码tokenizer因训练数据源多样性不足导致未充分训练令牌的问题，通过提出源属性BPE（SA-BPE）技术来正则化BPE训练，显著减少了未充分训练令牌的数量，同时保持与常规BPE相同的推理过程。

摘要翻译

大型语言模型（LLM）的效率与安全性等因素均依赖于分词的质量。优质的分词器不仅能提升推理速度与语言理解能力，还能额外增强对越狱攻击的防御力，并降低幻觉风险。本研究从数据源多样性的角度，深入探讨了代码分词的效率问题。我们发现，由于训练数据中代码库与语言多样性的不平衡，以及大量特定于数据源、重复且在未来推理中往往无法使用的词汇占主导地位，代码分词器容易产生未被使用因而训练不足的词汇单元。通过改进字节对编码（BPE）的目标函数并引入合并跳过机制，我们在源属性BPE（SA-BPE）框架下实现了多种技术，以正则化BPE训练并最小化过拟合现象。该方法在保持与常规BPE相同推理流程的同时，显著减少了训练不足的词汇单元数量，从而为实际生产应用提供了一种高效工具。

摘要 (Abstract)

Efficiency and safety of Large Language Models (LLMs), among other factors, rely on the quality of tokenization. A good tokenizer not only improves inference speed and language understanding but also provides extra defense against jailbreak attacks and lowers the risk of hallucinations. In this work, we investigate the efficiency of code tokenization, in particular from the perspective of data source diversity. We demonstrate that code tokenizers are prone to producing unused, and thus under-trained, tokens due to the imbalance in repository and language diversity in the training data, as well as the dominance of source-specific, repetitive tokens that are often unusable in future inference. By modifying the BPE objective and introducing merge skipping, we implement different techniques under the name Source-Attributed BPE (SA-BPE) to regularize BPE training and minimize overfitting, thereby substantially reducing the number of under-trained tokens while maintaining the same inference procedure as with regular BPE. This provides an effective tool suitable for production use.

关键词: Large Language Models, tokenization, code tokenizer, BPE, source attribution, regularization, under-trained tokens, inference efficiency

115. ❌ $π$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data

作者: Yaocheng Zhang, Yuanheng Zhu, Wenyue Chong, Songjun Tu, Qichao Zhang, Jiajun Chai, Xiaohan Wang, Wei Lin, Guojun Yin, Dongbin Zhao 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.14054v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《π-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data》主要研究多智能体自博弈框架，通过特权自蒸馏改进深度搜索智能体的训练效率。该论文与大多数关键词无关，因为它不涉及大语言模型（LLMs）、模型架构（如MoE、SLMs）、训练技术（如预训练、微调、对齐、RLHF、PEFT）、推理优化（如RAG、注意力机制、量化）、推理方法（如思维链、系统2思维、MCTS）、模型解释性、世界模型或科学AI应用。唯一相关的关键词是“Multi-agent Systems OR Agent Coordination”，评分为10分，因为论文的核心是提出一个多智能体自进化框架（π-Play），涉及考官、教师和学生智能体的协调，以解决稀疏奖励和信用分配问题。其他关键词如“Self-Correction OR Self-Improvement OR Self-Reflection”评分为0分，因为论文虽涉及自改进，但侧重于自蒸馏和自博弈，而非典型的自我纠正或反思机制。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为π-Play的多智能体自博弈框架，通过利用任务生成过程中的问题构建路径作为特权信息进行自蒸馏，将稀疏奖励自博弈转化为密集反馈自进化循环，从而在没有外部数据的情况下超越了全监督搜索智能体，并将进化效率提高了2-3倍。

摘要翻译

深度搜索智能体已成为解决复杂信息检索任务的一种前景广阔的研究范式，但其训练过程仍面临奖励稀疏、信用分配困难以及标注数据有限等挑战。自我博弈为降低数据依赖性提供了一条可扩展的路径，但传统自我博弈仅通过稀疏的结果奖励来优化学生智能体，导致学习效率低下。本研究发现，自我博弈在任务生成过程中自然产生一种问题构建路径，这是一种捕捉逆向求解过程的中间产物。这揭示了一种新的特权信息来源用于自蒸馏：自我博弈本身能够以低成本、可扩展的方式为教师模型提供高质量的特权上下文，而无需依赖人工反馈或精心构建的特权信息。基于这一洞见，我们提出特权信息自我博弈框架。在该框架中，考官生成任务及其对应的问题构建路径，教师模型则利用问题构建路径作为特权上下文，通过自蒸馏对学生模型进行密集监督。这一设计将传统的稀疏奖励自我博弈转变为密集反馈的自我进化循环。大量实验表明，无需外部数据的$π$-Play框架性能超越全监督搜索智能体，并将进化效率较传统自我博弈提升2-3倍。

摘要 (Abstract)

Deep search agents have emerged as a promising paradigm for addressing complex information-seeking tasks, but their training remains challenging due to sparse rewards, weak credit assignment, and limited labeled data. Self-play offers a scalable route to reduce data dependence, but conventional self-play optimizes students only through sparse outcome rewards, leading to low learning efficiency. In this work, we observe that self-play naturally produces a question construction path (QCP) during task generation, an intermediate artifact that captures the reverse solution process. This reveals a new source of privileged information for self-distillation: self-play can itself provide high-quality privileged context for the teacher model in a low-cost and scalable manner, without relying on human feedback or curated privileged information. Leveraging this insight, we propose Privileged Information Self-Play ($π$-Play), a multi-agent self-evolution framework. In $π$-Play, an examiner generates tasks together with their QCPs, and a teacher model leverages QCP as privileged context to densely supervise a student via self-distillation. This design transforms conventional sparse-reward self-play into a dense-feedback self-evolution loop. Extensive experiments show that data-free $π$-Play surpasses fully supervised search agents and improves evolutionary efficiency by 2-3$\times$ over conventional self-play.

关键词: multi-agent systems, self-play, privileged distillation, deep search agents, sparse rewards, self-evolution, question construction path, data-free training

116. ❌ Dual-Enhancement Product Bundling: Bridging Interactive Graph and Large Language Model

作者: Zhe Huang, Peng Wang, Yan Zheng, Sen Song, Longjun Cai 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.14030v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文的核心贡献是提出了一种结合交互图学习和LLM语义理解的产品捆绑推荐方法，其中LLM是核心组件之一（通过Dynamic Concept Binding Mechanism将图结构转换为自然语言提示）。因此，只有’Large Language Models OR LLMs OR Foundation Models’这一关键词高度相关（10分），因为论文明确使用LLM进行语义理解。其他关键词如MoE、SFT、RAG、量化等均未在摘要中提及或涉及，故评分为0分。论文属于大模型在电子商务领域的应用研究，符合研究背景中’大模型在不同领域的研究应用’的范畴。

!!! tip deepseek-chat TL;DR

该论文针对产品捆绑推荐中冷启动问题和LLM无法直接建模交互图的挑战，提出了一种双增强方法，通过动态概念绑定机制将图结构转换为自然语言提示，结合LLM语义理解，在三个基准数据集上实现了6.3%-26.5%的性能提升。

摘要翻译

产品捆绑通过推荐互补商品组合提升电子商务收入。然而现有方法面临两大关键挑战：(1) 协同过滤方法因依赖历史交互数据而难以处理冷启动商品；(2) 大语言模型(LLM)缺乏直接建模交互图的内在能力。为弥补这一鸿沟，我们提出一种融合交互图学习与基于LLM语义理解的双增强产品捆绑方法。该方法引入图到文本范式，利用动态概念绑定机制(DCBM)将图结构转化为自然语言提示。DCBM在实现领域特定实体与LLM分词对齐方面发挥关键作用，使其能有效理解组合约束关系。在三个基准数据集(POG、POG_dense、Steam)上的实验表明，本方法相较最先进基线模型获得6.3%-26.5%的性能提升。

摘要 (Abstract)

Product bundling boosts e-commerce revenue by recommending complementary item combinations. However, existing methods face two critical challenges: (1) collaborative filtering approaches struggle with cold-start items owing to dependency on historical interactions, and (2) LLMs lack inherent capability to model interactive graph directly. To bridge this gap, we propose a dual-enhancement method that integrates interactive graph learning and LLM-based semantic understanding for product bundling. Our method introduces a graph-to-text paradigm, which leverages a Dynamic Concept Binding Mechanism (DCBM) to translate graph structures into natural language prompts. The DCBM plays a critical role in aligning domain-specific entities with LLM tokenization, enabling effective comprehension of combinatorial constraints. Experiments on three benchmarks (POG, POG_dense, Steam) demonstrate 6.3%-26.5% improvements over state-of-the-art baselines.

关键词: Product Bundling, Interactive Graph Learning, Large Language Models, Graph-to-Text, Dynamic Concept Binding Mechanism, Cold-start Problem, E-commerce Recommendation, Semantic Understanding

117. ❌ Parameter Importance is Not Static: Evolving Parameter Isolation for Supervised Fine-Tuning

作者: Zekai Lin, Chao Xue, Di Liang, Xingsheng Han, Peiyang Liu, Xianjie Wu, Lei Jiang, Yu Lu, Haibo Shi, Shuang Liang, Minlong Peng 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.14010v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	15.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究监督微调（SFT）中的参数隔离方法，与"Post-training OR Supervised Fine-tuning OR SFT"高度相关（15分），直接研究SFT中的任务干扰和灾难性遗忘问题。与"Large Language Models OR LLMs OR Foundation Models"相关（10分），因为论文研究大语言模型的微调。与"PEFT OR LoRA OR Parameter-efficient Fine-tuning"相关（10分），因为EPI框架属于参数高效微调方法，通过动态隔离参数来提高效率。其他关键词与论文内容无直接关联，得0分。

!!! tip deepseek-chat TL;DR

本文针对监督微调中参数重要性会随时间变化的问题，提出了动态参数隔离框架EPI，通过周期性更新隔离掩码来减少任务干扰和灾难性遗忘，提高了多任务学习的泛化能力。

摘要翻译

大型语言模型的监督微调常面临任务干扰与灾难性遗忘问题。近期研究通过隔离训练过程中的任务关键参数来缓解这一现象。然而，这些方法为动态问题提供了静态解决方案，其假设参数重要性一经确定即保持不变。本研究通过实证表明，参数重要性在训练过程中会随时间发生漂移。为此，我们提出动态参数隔离框架，该微调框架基于参数重要性的在线评估自适应调整隔离策略。EPI并非冻结固定参数子集，而是利用基于梯度的信号周期性更新隔离掩码，使模型能够保护新出现的任务关键参数，同时释放过时参数以恢复模型可塑性。在多任务基准测试上的实验表明，相较于静态隔离方法与标准微调，EPI能持续减少任务干扰与遗忘现象，并提升整体泛化能力。我们的分析进一步揭示了隔离机制与多样化能力学习动态演变过程保持同步的必要性。

摘要 (Abstract)

Supervised Fine-Tuning (SFT) of large language models often suffers from task interference and catastrophic forgetting. Recent approaches alleviate this issue by isolating task-critical parameters during training. However, these methods represent a static solution to a dynamic problem, assuming that parameter importance remains fixed once identified. In this work, we empirically demonstrate that parameter importance exhibits temporal drift over the course of training. To address this, we propose Evolving Parameter Isolation (EPI), a fine-tuning framework that adapts isolation decisions based on online estimates of parameter importance. Instead of freezing a fixed subset of parameters, EPI periodically updates isolation masks using gradient-based signals, enabling the model to protect emerging task-critical parameters while releasing outdated ones to recover plasticity. Experiments on diverse multi-task benchmarks demonstrate that EPI consistently reduces interference and forgetting compared to static isolation and standard fine-tuning, while improving overall generalization. Our analysis highlights the necessity of synchronizing isolation mechanisms with the evolving dynamics of learning diverse abilities.

关键词: Supervised Fine-Tuning, Parameter Isolation, Task Interference, Catastrophic Forgetting, Evolving Parameter Importance, Multi-task Learning, Gradient-based Signals, Generalization Improvement

118. ❌ Causal Drawbridges: Characterizing Gradient Blocking of Syntactic Islands in Transformer LMs

作者: Sasha Boguraev, Kyle Mahowald 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13950v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文使用Transformer语言模型研究句法岛屿现象，通过因果干预分析模型内部机制，与’Large Language Models’高度相关（核心研究对象），与’Mechanistic Interpretability’高度相关（核心方法论）。其他关键词涉及模型架构、训练方法、应用领域等，论文未涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文通过因果干预分析Transformer语言模型对英语句法岛屿现象的表示，发现模型能复现人类对提取可接受性的梯度判断，并揭示了'and'在不同句法结构中的不同表征机制。

摘要翻译

我们通过聚焦句法理论中长期存在的挑战——句法孤岛现象，展示了在Transformer模型中进行因果干预如何为英语句法提供洞见。从并列动词短语中提取成分通常可接受性较低，但其可接受度会随词汇内容呈梯度变化（例如，“I know what he hates art and loves”与“I know what he looked down and saw”的对比）。研究表明，现代Transformer语言模型能够复现人类对这种梯度变化的判断。通过采用因果干预技术，隔离Transformer模块中功能相关的子空间（包括注意力机制和多层感知机），我们证明从并列结构孤岛中提取成分与标准wh-依存关系使用相同的填充语-空位机制，但这些机制会遭受不同程度的选择性阻断。通过将大规模无关文本语料投射到这些经因果识别的子空间上，我们提出了新的语言学假说：连词“and”在可提取结构与不可提取结构中具有不同的表征方式，分别对应编码关系依存性的表达与纯粹并列用法的表达。这些结果揭示了机制可解释性如何为句法研究提供信息，并催生出关于语言表征与加工过程的新假设。

摘要 (Abstract)

We show how causal interventions in Transformer models provide insights into English syntax by focusing on a long-standing challenge for syntactic theory: syntactic islands. Extraction from coordinated verb phrases is often degraded, yet acceptability varies gradiently with lexical content (e.g., “I know what he hates art and loves” vs. “I know what he looked down and saw”). We show that modern Transformer language models replicate human judgments across this gradient. Using causal interventions that isolate functionally relevant subspaces in Transformer blocks, attention modules, and MLPs, we demonstrate that extraction from coordination islands engages the same filler-gap mechanisms as canonical wh-dependencies, but that these mechanisms are selectively blocked to varying degrees. By projecting a large corpus of unrelated text onto these causally identified subspaces, we derive a novel linguistic hypothesis: the conjunction “and” is represented differently in extractable versus non-extractable constructions, corresponding to expressions encoding relational dependencies versus purely conjunctive uses. These results illustrate how mechanistic interpretability can inform syntax, generating new hypotheses about linguistic representation and processing.

关键词: Transformer language models, syntactic islands, causal interventions, mechanistic interpretability, filler-gap mechanisms, gradient acceptability, linguistic representation

119. ❌ CollabCoder: Plan-Code Co-Evolution via Collaborative Decision-Making for Efficient Code Generation

作者: Duy Tung Doan, Quang Huy Phung, Dzung Nguyen, Khac-Hoai Nam Bui 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13946v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	8.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出CollabCoder框架，专注于通过多智能体协作进行代码生成，核心是动态规划与代码模块的协同决策。与LLM相关（8分），因为代码生成通常基于LLM；与Chain of Thought和System 2 Thinking相关（各5分），涉及多步推理和深度决策；与Self-Correction高度相关（8分），框架包含调试和自改进过程；与LLM Agents和Multi-agent Systems高度相关（各10分），核心是多智能体协作框架；与Tool Use相关（8分），涉及API调用优化。其他关键词如MoE、SLMs、Scaling Laws等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出CollabCoder框架，通过动态多智能体协作实现计划与代码的协同演化，以解决传统代码生成方法在静态规划、高计算开销和适应性差方面的局限，实验表明其在提高代码质量和鲁棒性的同时，显著减少了API调用并提升了效率。

摘要翻译

自动化代码生成在软件工程中始终是一项持续挑战，因为传统的多智能体框架常受限于静态规划、孤立执行、高计算开销以及对复杂任务的有限适应性。本文提出CollabCoder，一种新颖的“规划-代码协同演化”框架，通过动态多智能体协作来改进代码生成。其核心思想是设计规划模块与代码模块之间的协同决策过程，以决定调试过程中应由哪个模块执行。在广泛使用的基准测试上进行的大量实验表明，CollabCoder在不同任务中持续提升了代码质量与鲁棒性。值得注意的是，CollabCoder在降低计算开销的同时，取得了与当前最先进方法相当或更优的性能，且随着基准测试难度增加，其效率优势更为显著。在更具挑战性的LiveCodeBench和xCodeEval基准测试上，本方法相较于强基线模型性能提升了11-20%，同时每次执行平均减少4-10次API调用。

摘要 (Abstract)

Automated code generation remains a persistent challenge in software engineering, as conventional multi-agent frameworks are often constrained by static planning, isolated execution, high computational overhead, and limited adaptability to complex tasks. This paper introduces CollabCoder, a novel Plan-Code Co-Evolution framework that improves code generation through dynamic multi-agent collaboration. The core idea is to design a collaborative decision-making process between the plan module and the code module to decide which module should be executed for the debugging process. Extensive experiments on widely used benchmarks demonstrate that CollabCoder consistently improves code quality and robustness across tasks. Importantly, CollabCoder achieves performance comparable to or exceeding current state-of-the-art methods while reducing computational overhead, with efficiency gains becoming more pronounced as benchmark difficulty increases. On the more challenging LiveCodeBench and xCodeEval benchmarks, our approach improves performance by 11-20% over strong baselines while reducing the number of API calls by an average of 4-10 per execution.

关键词: code generation, multi-agent collaboration, plan-code co-evolution, dynamic decision-making, computational efficiency, API call reduction, debugging process, software engineering

120. ❌ Beyond Static Personas: Situational Personality Steering for Large Language Models

作者: Zesheng Wei, Mengxiang Li, Zilei Wang, Yang Deng 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13846v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究个性化大语言模型（LLMs）的情境人格引导，核心贡献是提出IRIS框架（Identify-Retrieve-Steer），通过分析persona neurons实现训练免费的情境人格控制。因此，与’Large Language Models’高度相关（10分），因为论文聚焦于LLMs的个性化应用。与’Mechanistic Interpretability’有一定关联（5分），因为论文涉及对persona neurons的分析，属于模型可解释性范畴。其他关键词如MoE、SFT、RAG、量化等均未在摘要中提及或相关，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对现有个性化大语言模型静态人格建模的局限性，提出了IRIS框架，通过情境人格神经元识别、检索和引导，实现了在复杂情境下对LLM人格的动态控制，并在基准测试中超越了现有方法。

摘要翻译

个性化大型语言模型（LLM）在以人为中心的应用中促进了更自然、类人的交互。然而，现有的个性化方法受限于可控性不足与资源需求高的问题。此外，其对静态人格建模的依赖限制了其在不同情境下的适应性。为应对这些局限，我们首先通过对人格神经元的多视角分析，证明了LLM人格中存在情境依赖性以及一致的情境-行为模式。基于这些发现，我们提出了IRIS，一种无需训练、基于神经元的识别-检索-引导框架，用于实现高级的情境人格引导。我们的方法包括情境人格神经元识别、情境感知神经元检索以及相似性加权引导。我们在PersonalityBench和我们新引入的SPBench（一个全面的情境人格基准测试集）上对我们的框架进行了实证验证。实验结果表明，我们的方法超越了性能最佳的基线模型，证明了IRIS对于复杂、未见过的情境以及不同模型架构的泛化能力和鲁棒性。

摘要 (Abstract)

Personalized Large Language Models (LLMs) facilitate more natural, human-like interactions in human-centric applications. However, existing personalization methods are constrained by limited controllability and high resource demands. Furthermore, their reliance on static personality modeling restricts adaptability across varying situations. To address these limitations, we first demonstrate the existence of situation-dependency and consistent situation-behavior patterns within LLM personalities through a multi-perspective analysis of persona neurons. Building on these insights, we propose IRIS, a training-free, neuron-based Identify-Retrieve-Steer framework for advanced situational personality steering. Our approach comprises situational persona neuron identification, situation-aware neuron retrieval, and similarity-weighted steering. We empirically validate our framework on PersonalityBench and our newly introduced SPBench, a comprehensive situational personality benchmark. Experimental results show that our method surpasses best-performing baselines, demonstrating IRIS’s generalization and robustness to complex, unseen situations and different models architecture.

关键词: Personalized Large Language Models, Situational Personality Steering, Persona Neurons, Training-free Framework, IRIS, Identify-Retrieve-Steer, SPBench, Generalization

121. ❌ Robust Reward Modeling for Large Language Models via Causal Decomposition

作者: Yunsheng Lu, Zijiang Yang, Licheng Pan, Zhixuan Chu 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13833v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于大语言模型对齐中的奖励模型改进，核心涉及LLM对齐和RLHF技术。论文明确研究奖励模型（Reward Models）在LLM对齐中的作用，属于RLHF/DPO范畴，因此与’Large Language Models’、‘Instruction Tuning/Alignment’、‘RLHF/RLAIF/DPO’高度相关（10分）。其他关键词如MoE、SLMs、Scaling Laws、RAG、CoT、Agents、Quantization等均未在摘要中提及或涉及，因此评分为0分。论文未涉及特定科学领域应用，故’AI for Science’等也为0分。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型对齐中奖励模型容易过拟合于虚假线索（如回答长度和讨好语气）的问题，提出了一种通过因果分解学习潜在意图嵌入来正则化奖励模型的方法，在数学、帮助性和安全性基准测试中提高了选择准确性和鲁棒性。

摘要翻译

奖励模型是使大语言模型对齐的核心工具，但它们常会过度拟合虚假线索，如回答长度和过度讨好的语气。先前的研究大多通过惩罚或控制特定人为痕迹来直接削弱这些线索，但并未明确鼓励模型将偏好基于提示的真实意图。我们训练了一个解码器，可将候选答案映射至输入内容的潜在意图嵌入向量。其重建误差被用作正则化奖励模型训练的信号。我们从理论上证明，该信号能强化与提示相关的信息，同时抑制不依赖提示的捷径行为。在数学、助益性和安全性基准测试中，该解码器以0.877的准确率筛选出更简短且更少奉承的候选答案。将此信号整合至Gemma-2-2B-it和Gemma-2-9B-it的奖励模型训练后，RewardBench准确率从0.832提升至0.868。在Best-of-N选择任务中，我们的框架在生成更简短输出的同时提高了长度控制的胜率，并在受控改写测试中对回答延长和轻度偏题现象保持了鲁棒性。

摘要 (Abstract)

Reward models are central to aligning large language models, yet they often overfit to spurious cues such as response length and overly agreeable tone. Most prior work weakens these cues directly by penalizing or controlling specific artifacts, but it does not explicitly encourage the model to ground preferences in the prompt’s intent. We learn a decoder that maps a candidate answer to the latent intent embedding of the input. The reconstruction error is used as a signal to regularize the reward model training. We provide theoretical evidence that this signal emphasizes prompt-dependent information while suppressing prompt-independent shortcuts. Across math, helpfulness, and safety benchmarks, the decoder selects shorter and less sycophantic candidates with 0.877 accuracy. Incorporating this signal into RM training in Gemma-2-2B-it and Gemma-2-9B-it increases RewardBench accuracy from 0.832 to 0.868. For Best-of-N selection, our framework increases length-controlled win rates while producing shorter outputs, and remains robust to lengthening and mild off-topic drift in controlled rewrite tests.

关键词: Reward Modeling, Large Language Models, Alignment, Causal Decomposition, Intent Embedding, Regularization, RLHF, Preference Learning

122. ❌ MUSE: Multi-Domain Chinese User Simulation via Self-Evolving Profiles and Rubric-Guided Alignment

作者: Zihao Liu, Hantao Zhou, Jiguo Li, Jun Xu, Jiuchong Gao, Jinghua Hao, Renqing He, Peng Wang 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13828v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文MUSE专注于用户模拟器的开发，属于大模型在交互式AI系统中的应用。核心相关关键词包括：1) ‘Post-training OR Supervised Fine-tuning OR SFT’ (10分)：论文明确使用Role-Reversal Supervised Fine-Tuning；2) ‘LLM Agents OR Autonomous Agents OR Agentic Workflow’ (10分)：用户模拟器本质上是LLM驱动的智能体；3) ‘Large Language Models OR LLMs OR Foundation Models’ (8分)：基于大模型构建；4) ‘Instruction Tuning OR Alignment OR Value Alignment’ (8分)：涉及rubric-guided alignment；5) ‘Self-Correction OR Self-Improvement OR Self-Reflection’ (5分)：IPSE机制涉及自我优化。其他关键词如MoE、量化、推理加速等与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文针对现有用户模拟器存在用户画像浅层、长对话中角色一致性差以及多语言多领域支持不足的问题，提出了MUSE框架，通过迭代式画像自进化、角色反转监督微调和基于准则的多轮强化学习，实现了更真实、连贯和角色一致的多领域中文用户模拟。

摘要翻译

用户模拟器对于交互式人工智能系统的规模化训练与评估至关重要。然而，现有方法通常依赖于浅层的用户画像构建，难以在长程交互中保持人设一致性，且大多局限于英语或单一领域场景。本文提出MUSE，一个面向多领域的中文用户模拟框架，旨在生成类人、可控且行为一致的响应。首先，我们提出迭代式画像自演进方法，通过比较并推理模拟轨迹与真实对话行为之间的差异，逐步优化用户画像。随后，我们采用角色反转监督微调技术，以提升局部响应的真实感与类人表达。为实现细粒度的行为对齐，我们进一步训练了一个基于专项评估准则的奖励模型，并将其融入准则引导的多轮强化学习中，从而在对话层面优化模拟器，并增强长程交互中的行为一致性。实验表明，MUSE在话语级与会话级评估中均持续优于现有基线模型，能够在扩展交互中生成更真实、连贯且符合人设的响应。

摘要 (Abstract)

User simulators are essential for the scalable training and evaluation of interactive AI systems. However, existing approaches often rely on shallow user profiling, struggle to maintain persona consistency over long interactions, and are largely limited to English or single-domain settings. We present MUSE, a multi-domain Chinese user simulation framework designed to generate human-like, controllable, and behaviorally consistent responses. First, we propose Iterative Profile Self-Evolution (IPSE), which gradually optimizes user profiles by comparing and reasoning discrepancies between simulated trajectories and real dialogue behaviors. We then apply Role-Reversal Supervised Fine-Tuning to improve local response realism and human-like expression. To enable fine-grained behavioral alignment, we further train a specialized rubric-based reward model and incorporate it into rubric-guided multi-turn reinforcement learning, which optimizes the simulator at the dialogue level and enhances long-horizon behavioral consistency. Experiments show that MUSE consistently outperforms strong baselines in both utterance-level and session-level evaluations, generating responses that are more realistic, coherent, and persona-consistent over extended interactions.

关键词: user simulation, multi-domain, Chinese, self-evolving profiles, supervised fine-tuning, rubric-guided alignment, reinforcement learning, behavioral consistency

123. ❌ ToolOmni: Enabling Open-World Tool Use via Agentic learning with Proactive Retrieval and Grounded Execution

作者: Shouzheng Huang, Meishan Zhang, Baotian Hu, Min Zhang 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13787v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	8.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在开放世界场景下的工具使用问题，通过代理学习框架结合主动检索和基于推理的执行循环。与LLM、SFT、LLM Agents、Tool Use高度相关（10分），因为论文明确使用LLM、SFT训练代理能力、构建代理框架、解决工具使用问题。与RAG（8分）相关，因为涉及主动检索机制，但更侧重于工具检索而非生成增强。与CoT Reasoning（8分）相关，因为框架包含推理循环，但未明确使用CoT术语。其他关键词如MoE、SLMs、Scaling Laws等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对LLM在开放世界场景中工具使用准确率低的问题，提出了ToolOmni代理框架，通过主动检索和基于推理的执行循环，实现了端到端执行成功率显著提升10.8%。

摘要翻译

大型语言模型（LLMs）通过利用外部工具来增强其问题解决能力。然而，在工具库海量且动态演化的开放世界场景中，现有依赖静态嵌入检索或工具参数记忆的方法，分别难以将用户意图与工具语义对齐或泛化至未见过的工具，导致开放世界工具检索与执行的准确率欠佳。为解决这些问题，我们提出了ToolOmni，一个统一的智能体框架，通过在推理循环内进行主动检索与基于上下文的执行，使LLMs能够实现开放世界工具使用。首先，我们构建了一个冷启动多轮交互数据集，通过监督微调（SFT）来注入基础的智能体能力。随后，我们引入了基于解耦多目标GRPO算法的开放世界工具学习，该算法同时优化LLMs在在线环境中的工具检索准确率和执行效能。大量实验表明，ToolOmni在检索与执行两方面均达到了最先进的性能，其端到端执行成功率显著超越强基线模型+10.8%，同时展现出卓越的鲁棒性和泛化能力。

摘要 (Abstract)

Large Language Models (LLMs) enhance their problem-solving capability by utilizing external tools. However, in open-world scenarios with massive and evolving tool repositories, existing methods relying on static embedding retrieval or parameter memorization of tools struggle to align user intent with tool semantics or generalize to unseen tools, respectively, leading to suboptimal accuracy of open-world tool retrieval and execution. To address these, we present ToolOmni, a unified agentic framework that enables LLMs for open-world tool use by proactive retrieval and grounded execution within a reasoning loop. First, we construct a cold-start multi-turn interaction dataset to instill foundational agentic capabilities via Supervised Fine-Tuning (SFT). Then, we introduce open-world tool learning based on a Decoupled Multi-Objective GRPO algorithm, which simultaneously optimizes LLMs for both tool retrieval accuracy and execution efficacy in online environments. Extensive experiments demonstrate that ToolOmni achieves state-of-the-art performance both in retrieval and execution, surpassing strong baselines by a significant margin of +10.8% in end-to-end execution success rate, while exhibiting exceptional robustness and generalization capabilities.

关键词: Large Language Models, Tool Use, Agentic Framework, Supervised Fine-Tuning, Retrieval-Augmented Generation, Open-World Scenarios, Reasoning Loop, Execution Success Rate

124. ❌ QuantileMark: A Message-Symmetric Multi-bit Watermark for LLMs

作者: Junlin Zhu, Baizhou Huang, Xiaojun Wan 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13786v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于LLM水印技术，仅与’Large Language Models’高度相关（10分），其他关键词涉及模型架构、训练方法、推理优化、应用领域等，均未在论文中涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为QuantileMark的多比特水印方法，解决了LLM生成内容中消息对称性的问题，在保持文本质量的同时提高了水印检测的鲁棒性和准确性。

摘要翻译

随着大语言模型成为内容生成的标准后端，实用溯源日益需要多比特水印技术。在提供商内部部署场景中，一个关键要求是消息对称性：消息本身不应系统性地影响文本质量或验证结果。基于词汇划分的水印方法在低熵解码时可能破坏消息对称性：部分消息被分配了大部分概率质量，而其他消息则被迫使用尾部词元，导致嵌入质量和消息解码准确度依赖于具体消息。我们提出QuantileMark，一种白盒多比特水印方案，将消息嵌入连续累积概率区间$[0, 1)$内。在每个生成步骤中，QuantileMark将该区间划分为$M$个等质量区间，并严格从目标符号对应的区间内采样，确保无论上下文熵值如何都保持固定的$1/M$概率预算。在检测端，验证器通过教师强制解码重建相同划分，计算潜在区间的后验概率，并聚合证据进行验证。我们证明了消息无偏性——该性质确保在对消息取平均时可恢复基础分布。这为生成端的对称性提供了理论基础，而等质量设计进一步促进了检测端跨消息的证据强度均匀性。在C4续写和LFQA任务上的实验结果表明，相较于强基线方法，本方案在多比特恢复能力和检测鲁棒性方面均有提升，且对生成质量的影响可忽略不计。代码已发布于GitHub（https://github.com/zzzjunlin/QuantileMark）。

摘要 (Abstract)

As large language models become standard backends for content generation, practical provenance increasingly requires multi-bit watermarking. In provider-internal deployments, a key requirement is message symmetry: the message itself should not systematically affect either text quality or verification outcomes. Vocabulary-partition watermarks can break message symmetry in low-entropy decoding: some messages are assigned most of the probability mass, while others are forced to use tail tokens. This makes embedding quality and message decoding accuracy message-dependent. We propose QuantileMark, a white-box multi-bit watermark that embeds messages within the continuous cumulative probability interval $[0, 1)$. At each step, QuantileMark partitions this interval into $M$ equal-mass bins and samples strictly from the bin assigned to the target symbol, ensuring a fixed $1/M$ probability budget regardless of context entropy. For detection, the verifier reconstructs the same partition under teacher forcing, computes posteriors over latent bins, and aggregates evidence for verification. We prove message-unbiasedness, a property ensuring that the base distribution is recovered when averaging over messages. This provides a theoretical foundation for generation-side symmetry, while the equal-mass design additionally promotes uniform evidence strength across messages on the detection side. Empirical results on C4 continuation and LFQA show improved multi-bit recovery and detection robustness over strong baselines, with negligible impact on generation quality. Our code is available at GitHub (https://github.com/zzzjunlin/QuantileMark).

关键词: Large Language Models, Multi-bit Watermarking, Message Symmetry, QuantileMark, White-box Watermark, Text Generation, Watermark Detection, Cumulative Probability Partition

125. ❌ Who Gets Flagged? The Pluralistic Evaluation Gap in AI Content Watermarking

作者: Alexander Nemecek, Osama Zafar, Yuqiao Xu, Wenbiao Li, Erman Ayday 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13776v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究AI内容水印技术的评估偏差问题，涉及文本、图像、音频多模态水印的性能差异，并提出多元评估框架。所有评分关键词均聚焦于大模型/深度学习技术原理、训练方法、推理优化、应用等具体技术方向，而本文主题是AI水印技术的公平性评估，属于AI治理和评估方法论范畴，与这些具体技术关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文研究发现AI内容水印技术的检测性能因语言、文化和人口群体而异，现有评估存在偏差，并提出多元评估维度以解决这一问题。

摘要翻译

水印技术正逐渐成为AI内容认证的默认机制，各类治理政策与框架将其视为内容溯源的基础设施。然而，在文本、图像和音频模态中，水印信号的强度、可检测性与鲁棒性均依赖于内容本身的统计特性，而这些特性在不同语言、文化视觉传统和人口群体间存在系统性差异。本文探讨了这种内容依赖性如何导致特定模态的偏见路径。通过检视各模态主要的水印基准测试，我们发现除一项例外，现有研究均未报告跨语言、跨文化内容类型或跨人群的性能表现。为此，我们提出多元化水印基准测试的三个具体评估维度：跨语言检测公平性、文化多样性内容覆盖度以及检测指标的人口群体细分。我们将这些维度与当前强制部署水印的治理框架相联系，指出水印技术所遵循的公平性标准低于其本应监管的生成式系统。我们的立场是：评估必须先于部署，应用于AI模型的偏见审计要求同样应延伸至验证层。

摘要 (Abstract)

Watermarking is becoming the default mechanism for AI content authentication, with governance policies and frameworks referencing it as infrastructure for content provenance. Yet across text, image, and audio modalities, watermark signal strength, detectability, and robustness depend on statistical properties of the content itself, properties that vary systematically across languages, cultural visual traditions, and demographic groups. We examine how this content dependence creates modality-specific pathways to bias. Reviewing the major watermarking benchmarks across modalities, we find that, with one exception, none report performance across languages, cultural content types, or population groups. To address this, we propose three concrete evaluation dimensions for pluralistic watermark benchmarking: cross-lingual detection parity, culturally diverse content coverage, and demographic disaggregation of detection metrics. We connect these to the governance frameworks currently mandating watermarking deployment and show that watermarking is held to a lower fairness standard than the generative systems it is meant to govern. Our position is that evaluation must precede deployment, and that the same bias auditing requirements applied to AI models should extend to the verification layer.

关键词: AI watermarking, content authentication, bias evaluation, cross-lingual detection, cultural diversity, demographic disaggregation, fairness standards, governance frameworks

126. ❌ MedRCube: A Multidimensional Framework for Fine-Grained and In-Depth Evaluation of MLLMs in Medical Imaging

作者: Zhijie Bao, Fangke Chen, Licheng Bao, Chenhui Zhang, Wei Chen, Jiajie Peng, Zhongyu Wei 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13756v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文专注于医学影像领域的多模态大语言模型（MLLMs）评估框架，与’Large Language Models’和’AI for Science’高度相关（10分）。论文强调评估推理机制和可靠性，与’Chain of Thought’和’System 2 Thinking’有一定关联（8分）。论文关注可信度和临床部署，与’Hallucination Mitigation’和’Explainable AI’有弱关联（5分）。其他关键词如MoE、SFT、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个多维度的精细评估框架MedRCube，用于评估多模态大语言模型在医学影像领域的性能，发现Lingshu-32B表现最佳，并揭示了捷径行为与诊断性能之间的显著正相关关系。

摘要翻译

多模态大语言模型（MLLMs）在医学影像领域的潜力，催生了对系统且严谨评估框架的需求，这些框架需与实际医学影像实践相契合。现有评估方法通常仅报告单一或粗粒度指标，缺乏专业临床支持所需的细粒度分析，且无法评估其推理机制的可靠性。为此，我们提出向多维度、细粒度且深入的评估范式转变。基于为此范式设计的两阶段系统构建流程，我们实例化了MedRCube评估体系。我们对33个MLLMs进行了基准测试，其中\textit{Lingshu-32B}取得了顶尖性能。关键的是，MedRCube揭示了一系列在先前的评估设置下无法获得的显著洞见。此外，我们引入了可信度评估子集以量化推理可信度，发现捷径行为与诊断任务表现之间存在高度显著的正相关关系，这为临床可信部署提出了重要关切。本工作的相关资源可在https://github.com/F1mc/MedRCube获取。

摘要 (Abstract)

The potential of Multimodal Large Language Models (MLLMs) in domain of medical imaging raise the demands of systematic and rigorous evaluation frameworks that are aligned with the real-world medical imaging practice. Existing practices that report single or coarse-grained metrics are lack the granularity required for specialized clinical support and fail to assess the reliability of reasoning mechanisms. To address this, we propose a paradigm shift toward multidimensional, fine-grained and in-depth evaluation. Based on a two-stage systematic construction pipeline designed for this paradigm, we instantiate it with MedRCube. We benchmark 33 MLLMs, \textit{Lingshu-32B} achieve top-tier performance. Crucially, MedRCube exposes a series of pronounced insights inaccessible under prior evaluation settings. Furthermore, we introduce a credibility evaluation subset to quantify reasoning credibility, uncover a highly significant positive association between shortcut behavior and diagnostic task performance, raising concerns for clinically trustworthy deployment. The resources of this work can be found at https://github.com/F1mc/MedRCube.

关键词: Multimodal Large Language Models, Medical Imaging, Evaluation Framework, Fine-grained Evaluation, Reasoning Credibility, Clinical Trustworthiness, Benchmarking, Shortcut Behavior

127. ❌ Doc-V*:Coarse-to-Fine Interactive Visual Reasoning for Multi-Page Document VQA

作者: Yuanlei Zheng, Pei Fu, Hang Li, Ziyang Wang, Yuyi Zhang, Wenyu Ruan, Xiaojin Zhang, Zhongyu Wei, Zhenbo Luo, Jian Luan, Wei Chen, Xiang Bai 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13731v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	8.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出Doc-V*框架，这是一个OCR-free的agentic框架，用于多页文档视觉问答。核心创新在于将多页DocVQA建模为顺序证据聚合的智能体工作流程。与关键词的相关性分析：1. 与’LLM Agents/Autonomous Agents/Agentic Workflow’高度相关（10分），因为论文明确提出了agentic框架并实现了自主导航和证据聚合的工作流程。2. 与’Retrieval-Augmented Generation/RAG’较强相关（8分），论文将RAG作为基线比较，并改进了检索机制。3. 与’Chain of Thought/CoT Reasoning’和’System 2 Thinking/Slow Thinking’有一定关联（各5分），因为框架涉及结构化工作记忆和基于证据的推理。4. 与’Large Language Models’有基本关联（5分），虽然未明确使用LLM，但属于大模型在文档理解领域的应用。5. 与’Tool Use’有一定关联（5分），框架实现了页面导航和证据检索的工具使用功能。其他关键词与论文内容无直接关系，均给0分。

!!! tip deepseek-chat TL;DR

该研究提出了Doc-V*框架，通过将多页文档视觉问答建模为顺序证据聚合的智能体工作流程，在五个基准测试中超越了开源基线并接近专有模型性能，将跨领域性能提升了47.9%。

摘要翻译

多页文档视觉问答任务要求对篇幅冗长、视觉信息密集的文档进行语义、版式和视觉元素的综合推理。现有的无光学字符识别方法面临能力与精度之间的权衡：端到端模型随文档长度增加而扩展性不佳，而基于视觉检索的流程则脆弱且被动。我们提出Doc-$V^$，一个无需OCR的智能体框架，将多页文档视觉问答任务转化为序列化证据聚合过程。Doc-$V^$首先通过缩略图概览文档，随后通过语义检索与定向页面抓取进行主动导航，并在结构化工作记忆中聚合证据以进行基于事实的推理。通过专家轨迹模仿学习进行训练，并采用群体相对策略优化进一步微调，Doc-$V^$在答案准确性与证据搜寻效率之间取得平衡。在五个基准测试中，Doc-$V^$的表现优于开源基线模型并接近专有模型性能，其领域外表现较检索增强生成基线提升高达47.9%。其他结果表明，该方法通过选择性注意力实现了有效的证据聚合，而非依赖增加输入页面数量。

摘要 (Abstract)

Multi-page Document Visual Question Answering requires reasoning over semantics, layouts, and visual elements in long, visually dense documents. Existing OCR-free methods face a trade-off between capacity and precision: end-to-end models scale poorly with document length, while visual retrieval-based pipelines are brittle and passive. We propose Doc-$V^$, an \textbf{OCR-free agentic} framework that casts multi-page DocVQA as sequential evidence aggregation. Doc-$V^$ begins with a thumbnail overview, then actively navigates via semantic retrieval and targeted page fetching, and aggregates evidence in a structured working memory for grounded reasoning. Trained by imitation learning from expert trajectories and further optimized with Group Relative Policy Optimization, Doc-$V^$ balances answer accuracy with evidence-seeking efficiency. Across five benchmarks, Doc-$V^$ outperforms open-source baselines and approaches proprietary models, improving out-of-domain performance by up to \textbf{47.9%} over RAG baseline. Other results reveal effective evidence aggregation with selective attention, not increased input pages.

关键词: Multi-page Document VQA, OCR-free, Agentic Framework, Sequential Evidence Aggregation, Visual Reasoning, Imitation Learning, Group Relative Policy Optimization, Retrieval-Augmented Generation

128. ❌ Hybrid Retrieval for COVID-19 Literature: Comparing Rank Fusion and Projection Fusion with Diversity Reranking

作者: Harishkumar Kishorkumar Prajapati 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13728v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文主要研究COVID-19科学文献的混合检索系统，使用稀疏检索（SPLADE）和稠密检索（BGE）方法，并比较了两种融合策略（RRF和投影融合）。论文与大多数关键词无关，因为这些关键词主要涉及大语言模型（LLM）的技术原理、训练方法、推理优化、对齐技术等。论文仅与两个关键词有微弱关联：1）“Retrieval-Augmented Generation (RAG)"：论文涉及检索系统，但未涉及生成或LLM，因此给5分（有一定关联）。2）“AI for Science”：论文将AI技术应用于COVID-19科学文献检索，属于科学领域的AI应用，因此给8分（高度相关，但非核心）。其他关键词均未涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文开发了一个用于COVID-19科学文献的混合检索系统，比较了基于排序的融合和基于投影的融合方法，发现排序融合在相关性上表现最佳，而投影融合在速度和多样性方面更有优势。

摘要翻译

我们提出了一种针对COVID-19科学文献的混合检索系统，并在TREC-COVID基准测试（包含171,332篇论文和50个专家查询）上进行了评估。该系统实现了六种检索配置，涵盖稀疏检索（SPLADE）、稠密检索（BGE）、排序级融合（RRF）以及一种基于投影的向量融合（B5）方法。RRF融合取得了最佳相关性（nDCG@10 = 0.828），优于纯稠密检索6.1%，优于纯稀疏检索14.9%。我们提出的投影融合变体在专家查询上达到nDCG@10 = 0.678，同时速度提升33%（847毫秒 vs. 1271毫秒），并且产生的ILD@10比RRF高2.2倍。通过对400个查询（包括专家查询、机器生成查询和三种释义风格查询）的评估表明，B5在关键词密集的重述查询上取得了最大的相对增益（+8.8%），尽管RRF在绝对nDCG@10上仍保持最优。在专家查询上，MMR重排序以20.4-25.4%的nDCG@10代价将列表内多样性提升了23.8-24.5%。针对延迟评估的两种融合流程在所有查询集上均保持在2秒以内的目标阈值以下。该系统已部署为一个基于Pinecone无服务器索引的Streamlit网络应用程序。

摘要 (Abstract)

We present a hybrid retrieval system for COVID-19 scientific literature, evaluated on the TREC-COVID benchmark (171,332 papers, 50 expert queries). The system implements six retrieval configurations spanning sparse (SPLADE), dense (BGE), rank-level fusion (RRF), and a projection-based vector fusion (B5) approach. RRF fusion achieves the best relevance (nDCG@10 = 0.828), outperforming dense-only by 6.1% and sparse-only by 14.9%. Our projection fusion variant reaches nDCG@10 = 0.678 on expert queries while being 33% faster (847 ms vs. 1271 ms) and producing 2.2x higher ILD@10 than RRF. Evaluation across 400 queries – including expert, machine-generated, and three paraphrase styles – shows that B5 delivers the largest relative gain on keyword-heavy reformulations (+8.8%), although RRF remains best in absolute nDCG@10. On expert queries, MMR reranking increases intra-list diversity by 23.8-24.5% at a 20.4-25.4% nDCG@10 cost. Both fusion pipelines evaluated for latency remain below the sub-2 s target across all query sets. The system is deployed as a Streamlit web application backed by Pinecone serverless indices.

关键词: hybrid retrieval, COVID-19 literature, rank fusion, projection fusion, diversity reranking, TREC-COVID, SPLADE, BGE

129. ❌ An Empirical Investigation of Practical LLM-as-a-Judge Improvement Techniques on RewardBench 2

作者: Ryan Lail 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13717v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	5.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM-as-a-judge技术，直接涉及大语言模型（LLMs）在评估中的应用，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文提到LLM-as-a-judge在RLHF pipelines中被广泛使用，因此与’RLHF OR RLAIF OR Direct Preference Optimization OR DPO’有一定关联（5分）。其他关键词如MoE、SLMs、Scaling Laws、Pre-training、Instruction Tuning、RAG、Context Window、KV Cache、CoT、Agents、Quantization、Hallucination、Interpretability、World Models、Model Merging、In-context Learning、AI for Science等均未在摘要中提及或与论文主题无关，因此得0分。

!!! tip deepseek-chat TL;DR

该论文研究了无需微调即可提升GPT-5.4在RewardBench 2上作为评估者（LLM-as-a-judge）准确性的实用技术，发现任务特定标准注入和集成评分相结合可将准确率从71.7%提升至83.6%。

摘要翻译

LLM-as-a-judge（大语言模型作为评判者）通过使用语言模型对候选回答进行评分或排序，已广泛作为人类评估的可扩展替代方案，应用于RLHF（人类反馈强化学习）流程、基准测试及应用层评估中。然而，其评判可靠性高度依赖于提示设计与聚合策略。本文对一系列实用、即插即用的技术进行了实证研究，这些技术无需微调即可提升GPT-5.4模型在RewardBench 2数据集上的评判准确率。其中两种技术贡献了绝大部分性能增益：任务特定准则注入（以可忽略的成本提升3.0个百分点）和集成评分（以5倍成本提升9.8个百分点）。结合使用时，准确率达到83.6%，较71.7%的基线提升11.9个百分点。本研究还探讨了另外三种技术（校准上下文、自适应模型升级和软性混合），但在相近成本下均未展现出相对于“准则注入+集成”方法的稳定优势。较低成本的模型层级通过集成获得尤为显著的收益：GPT-5.4 mini采用k=8集成时以1.2倍基线成本达到79.2%准确率，GPT-5.4 nano采用k=8集成时以0.4倍基线成本达到71.4%准确率，这为低成本实现高精度大语言模型评判提供了可行路径。

摘要 (Abstract)

LLM-as-a-judge, using a language model to score or rank candidate responses, is widely used as a scalable alternative to human evaluation in RLHF pipelines, benchmarking, and application layer evaluations (evals). However, judgment reliability depends heavily on prompting and aggregation strategy. We present an empirical investigation of practical, drop-in techniques that improve GPT-5.4 judge accuracy on RewardBench 2 without any finetuning. Two techniques account for nearly all available gains: task-specific criteria injection (+3.0pp at negligible cost) and ensemble scoring (+9.8pp at 5x cost). Combined, they reach 83.6% accuracy, +11.9pp over the 71.7% baseline. Our investigation also covers three further techniques (calibration context, adaptive model escalation, and soft blending) which did not reliably improve on criteria + ensembling at comparable cost. Cheaper model tiers benefit disproportionately from ensembling: GPT-5.4 mini with k=8 achieves 79.2% at 1.2x baseline cost, and GPT-5.4 nano with k=8 reaches 71.4% at 0.4x baseline cost, making high-accuracy LLM judges accessible at low cost.

关键词: LLM-as-a-judge, RewardBench 2, GPT-5.4, judge accuracy, task-specific criteria injection, ensemble scoring, RLHF pipelines, human evaluation

130. ❌ Learning the Cue or Learning the Word? Analyzing Generalization in Metaphor Detection for Verbs

作者: Sinan Kurtyigit, Sabine Schulte im Walde, Alexander Fraser 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13713v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究隐喻检测模型的泛化能力，使用RoBERTa模型进行微调实验，主要涉及自然语言处理中的隐喻检测任务。论文与大多数大模型技术关键词无关，仅与"Post-training OR Supervised Fine-tuning OR SFT"有一定关联（5分），因为论文使用了RoBERTa的微调方法，但这不是论文的核心创新点。论文未涉及大模型在不同领域的创新应用或技术原理创新，因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文通过控制词汇保留实验分析RoBERTa模型在英语动词隐喻检测中的泛化能力，发现模型主要通过学习可迁移的上下文模式实现泛化，而词汇记忆在词汇暴露时提供额外提升。

摘要翻译

隐喻检测模型在基准测试中表现出色，但其性能究竟反映了可迁移的泛化能力还是词汇记忆，目前尚不明确。为探究此问题，我们通过RoBERTa（众多先进系统的共享主干模型）分析了隐喻检测中的泛化现象，并以阿姆斯特丹自由大学隐喻语料库中的英语动词为研究对象。我们引入了一种受控的词汇留出设置：在微调阶段严格排除选定目标词元的所有实例，并将模型对这些留出词元的预测结果与已接触词元（微调过程中见过的动词）进行比较。模型在已接触词元上表现最佳，但在留出词元上仍保持稳健性能。进一步分析表明，仅凭句子上下文就足以在留出词元上达到与完整模型相当的性能，而静态的动词层面嵌入则无法做到这一点。综合来看，这些结果表明泛化能力主要源于“学习提示”（可迁移的上下文模式），而“学习词汇”（动词特异性记忆）在词汇接触可行时能提供额外的性能提升。

摘要 (Abstract)

Metaphor detection models achieve strong benchmark performance, yet it remains unclear whether this reflects transferable generalization or lexical memorization. To address this, we analyze generalization in metaphor detection through RoBERTa, the shared backbone of many state-of-the-art systems, focusing on English verbs using the VU Amsterdam Metaphor Corpus. We introduce a controlled lexical hold-out setup where all instances of selected target lemmas are strictly excluded from fine-tuning, and compare predictions on these Held-out lemmas against Exposed lemmas (verbs seen during fine-tuning). While the model performs best on Exposed lemmas, it maintains robust performance on Held-out lemmas. Further analysis reveals that sentence context alone is sufficient to match full-model performance on Held-out lemmas, whereas static verb-level embeddings are not. Together, these results suggest that generalization is primarily driven by “learning the cue” (transferable contextual patterns), while “learning the word” (verb-specific memorization) provides an additive boost when lexical exposure is available.

关键词: metaphor detection, generalization analysis, RoBERTa, lexical hold-out, contextual patterns, verb-specific memorization, VU Amsterdam Metaphor Corpus, fine-tuning

131. ❌ Breaking the Generator Barrier: Disentangled Representation for Generalizable AI-Text Detection

作者: Xiao Pu, Zepeng Cheng, Lin Yuan, Yu Wu, Xiuli Bi 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13692v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于AI生成文本检测，核心是解决LLMs生成文本的检测问题，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文未涉及其他关键词的具体技术（如MoE、SFT、RAG等）或应用领域（如生物信息学），因此其他关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种解耦表示框架，用于提高AI生成文本检测的泛化能力，在包含20个代表性LLMs的基准测试中实现了显著的性能提升。

摘要翻译

随着大语言模型生成的文本日益接近人类写作，区分AI生成内容与人类撰写内容的细微线索变得愈发难以捕捉。依赖生成器特定伪影的方法本质上不稳定，因为新模型快速涌现会削弱此类捷径的鲁棒性。这使未见过生成器的泛化检测成为AI文本检测领域核心且具挑战性的问题。为应对该挑战，我们提出一种渐进式结构化框架，将AI检测语义从生成器相关伪影中解耦。该框架通过鼓励语义极简性的紧凑潜在编码实现，继而采用基于扰动的正则化方法减少残留纠缠，最终通过判别性适应阶段将表征与任务目标对齐。在涵盖7个类别共20个代表性大语言模型的MAGE基准测试中，实验表明本方法相较于现有最优技术取得持续改进，最高获得24.2%的准确率提升与26.2%的F1分数改善。值得注意的是，随着训练生成器多样性的增加，模型性能持续提升，证实了在开放集场景中强大的可扩展性与泛化能力。我们的源代码将在https://github.com/PuXiao06/DRGD 公开。

摘要 (Abstract)

As large language models (LLMs) generate text that increasingly resembles human writing, the subtle cues that distinguish AI-generated content from human-written content become increasingly challenging to capture. Reliance on generator-specific artifacts is inherently unstable, since new models emerge rapidly and reduce the robustness of such shortcuts. This generalizes unseen generators as a central and challenging problem for AI-text detection. To tackle this challenge, we propose a progressively structured framework that disentangles AI-detection semantics from generator-aware artifacts. This is achieved through a compact latent encoding that encourages semantic minimality, followed by perturbation-based regularization to reduce residual entanglement, and finally a discriminative adaptation stage that aligns representations with task objectives. Experiments on MAGE benchmark, covering 20 representative LLMs across 7 categories, demonstrate consistent improvements over state-of-the-art methods, achieving up to 24.2% accuracy gain and 26.2% F1 improvement. Notably, performance continues to improve as the diversity of training generators increases, confirming strong scalability and generalization in open-set scenarios. Our source code will be publicly available at https://github.com/PuXiao06/DRGD.

关键词: AI-text detection, large language models, generalization, disentangled representation, generator-agnostic, MAGE benchmark, open-set scenarios

132. ❌ Calibrated Speculative Decoding: Frequency-Guided Candidate Selection for Efficient Inference

作者: Xuwen Zhou, Fangxin Liu, Chao Wang, Xiao Zheng, Hao Zheng, Min He, Li Jiang, Haibing Guan 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13634v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	15.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究内容是speculative decoding的改进方法，这是LLM推理加速的关键技术。论文标题和摘要明确提到’speculative decoding’，因此与’Speculative Decoding OR Inference Acceleration’高度相关（15分）。论文在多种大语言模型上进行评估，与’Large Language Models OR LLMs OR Foundation Models’相关（10分）。其他关键词如MoE、SLMs、Scaling Laws、训练方法、对齐、RAG、推理方法、代理、量化、幻觉缓解、可解释性、模型合并、上下文学习、科学AI等均未在论文中涉及，因此得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Calibrated Speculative Decoding的训练免费框架，通过频率引导的候选选择和概率保护的接受机制，解决了传统推测解码中因语义正确但词汇分歧导致的频繁错误拒绝问题，在保持模型准确性的同时实现了最高2.33倍的吞吐量加速。

摘要翻译

推测解码通过让草拟标记绕过完整验证来加速自回归生成，但传统框架存在频繁的虚假拒绝问题，尤其在草拟模型产生语义正确但词汇分歧的输出时更为突出。本文提出校准推测解码（Calibrated Speculative Decoding, CSD），这是一种无需训练的框架，旨在恢复被标准验证丢弃的有效标记。遵循“频率引导的候选选择与概率防护的接受”原则，CSD包含两个轻量级模块：在线校正记忆模块，通过聚合历史拒绝记录，将反复出现的分歧模式作为救援候选提出；以及语义一致性门控模块，该模块使用概率比值而非精确标记匹配来验证候选的可接受性。我们在多种大型语言模型上的评估表明，CSD优于现有方法，实现了最高2.33倍的吞吐量加速。CSD在所有任务中保持了模型准确性，并在复杂推理数据集上进一步提升了性能。这些结果确立了CSD作为一种高效、轻量级的解决方案，适用于实际的大型语言模型部署。

摘要 (Abstract)

Speculative decoding accelerates autoregressive generation by letting draft tokens bypass full verification, but conventional frameworks suffer from frequent false rejections, particularly when draft models produce semantically correct but lexically divergent outputs. In this paper, we present Calibrated Speculative Decoding (CSD), a training-free framework that recovers valid tokens discarded by standard verification. Guided by the principle of “Frequency-Guided Candidate Selection and Probability-Guarded Acceptance,” CSD incorporates two lightweight modules: Online Correction Memory, which aggregates historical rejections to propose recurring divergence patterns as rescue candidates, and Semantic Consistency Gating, which verifies candidate admissibility using probability ratios instead of exact token matching. Our evaluation across diverse large language models demonstrates that CSD outperforms existing methods, achieving a peak throughput speedup of 2.33x. CSD preserves model accuracy across all tasks while further boosting performance on complex reasoning datasets. These results establish CSD as a highly effective, lightweight solution for practical LLM deployments.

关键词: speculative decoding, inference acceleration, large language models, autoregressive generation, throughput speedup, training-free framework, probability-guided acceptance, semantic consistency

133. ❌ (How) Learning Rates Regulate Catastrophic Overtraining

作者: Mark Rofin, Aditya Varre, Nicolas Flammarion 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13627v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM的监督微调(SFT)过程中的灾难性过训练现象，与’Large Language Models’和’Post-training OR Supervised Fine-tuning OR SFT’高度相关(10分)。论文涉及预训练与微调的交互，与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联(5分)。论文提到SFT用于塑造助手行为，与’Instruction Tuning OR Alignment OR Value Alignment’有间接关联(5分)。其他关键词如MoE、量化、RAG等未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型在监督微调过程中学习率如何调节灾难性过训练现象，发现学习率衰减会增加预训练模型的锐度，从而加剧微调期间的灾难性遗忘，导致过训练。

摘要翻译

监督式微调（Supervised Fine-Tuning, SFT）是大语言模型后训练中常见的第一阶段，旨在教导模型遵循指令并塑造其作为有用助手的行为。然而，SFT也可能损害大语言模型的基础能力，尤其是在长时间预训练之后：这一现象被称为灾难性过度训练（catastrophic overtraining，Springer等人，2025）。为理解过度训练，我们首先从学习率的隐式正则化角度探究微调中的灾难性遗忘。对于训练至相同SFT损失的模型，我们揭示了学习率如何调节优化过程：使用大步长和小步长进行微调会收敛到性质不同的模型。接着，我们将遗忘与过度训练联系起来：学习率衰减会增加预训练模型的锐度，这反过来加剧了SFT期间的灾难性遗忘，从而导致过度训练。我们的研究结果描绘了大语言模型中过度训练机制的图景，并广泛促进了对预训练与微调期间优化动态之间相互作用的理解。

摘要 (Abstract)

Supervised fine-tuning (SFT) is a common first stage of LLM post-training, teaching the model to follow instructions and shaping its behavior as a helpful assistant. At the same time, SFT may harm the fundamental capabilities of an LLM, particularly after long pretraining: a phenomenon known as catastrophic overtraining (Springer et al., 2025). To understand overtraining, we first investigate catastrophic forgetting in finetuning through the lens of implicit regularization of the learning rate. For models trained to the same SFT loss, we identify how the learning rate mediates optimization: finetuning with large and small steps converges to qualitatively different models. Next, we link forgetting to overtraining: learning rate decay increases the sharpness of the pretrained model, which in turn exacerbates catastrophic forgetting during SFT, leading to overtraining. Our findings paint a picture of the overtraining mechanism in LLMs and broadly contribute to the understanding of the interplay between optimization dynamics during pretraining and finetuning.

关键词: Supervised Fine-tuning, SFT, LLM, Catastrophic Overtraining, Learning Rate, Pretraining, Finetuning, Catastrophic Forgetting

134. ❌ MM-Doc-R1: Training Agents for Long Document Visual Question Answering through Multi-turn Reinforcement Learning

作者: Jiahang Lin, Kai Hu, Binghai Wang, Yuhao Zhou, Zhiheng Xi, Honglin Guo, Shichun Liu, Junzhe Wang, Shihan Dou, Enyu Zhou, Hang Yan, Zhenhua Han, Tao Gui, Qi Zhang, Xuanjing Huang 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13579v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究RAG系统在长文档视觉问答中的应用，并提出了基于智能体（agentic）的工作流程和新的强化学习算法SPO。因此与’Retrieval-Augmented Generation (RAG)‘高度相关（10分），与’LLM Agents’高度相关（10分），与’Large Language Models’有一定关联（8分，因使用Qwen模型）。其他关键词如MoE、量化、推理加速等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对传统RAG系统在处理长文档视觉问答中多跳查询的不足，提出了MM-Doc-R1框架和SPO强化学习算法，在MMLongbench-Doc基准上取得了10.4%的性能提升。

摘要翻译

传统的检索增强生成（Retrieval-Augmented Generation，RAG）系统因其单轮检索机制，在处理长文档中的复杂多跳查询时往往面临困难。我们提出了MM-Doc-R1这一新颖框架，它采用一种具身智能、视觉感知的工作流程，通过迭代式信息发现与综合来解决长文档视觉问答任务。为了增强智能体的信息探索能力，我们提出了基于相似性的策略优化（Similarity-based Policy Optimization，SPO），以解决现有如GRPO等多轮强化学习算法中存在的基线估计偏差问题。我们的核心见解是：在多轮强化学习中，两条轨迹的语义越相似，它们共享的基线估计就越准确。基于此，SPO通过对多条轨迹的奖励进行相似性加权平均来计算更精确的基线，这与GRPO不恰当地将初始状态的基线应用于所有中间状态的做法不同。这为我们的智能体提供了更稳定、更准确的学习信号，从而实现了超越GRPO的优异训练性能。我们在MMLongbench-Doc基准测试上的实验表明，MM-Doc-R1以10.4%的优势超越了之前的基线方法。此外，SPO展现出优于GRPO的性能，在使用Qwen3-8B和Qwen3-4B模型时分别将结果提升了5.0%和6.1%。这些结果凸显了我们集成化框架与新型训练算法在推进复杂长文档视觉问答领域前沿技术方面的有效性。

摘要 (Abstract)

Conventional Retrieval-Augmented Generation (RAG) systems often struggle with complex multi-hop queries over long documents due to their single-pass retrieval. We introduce MM-Doc-R1, a novel framework that employs an agentic, vision-aware workflow to address long document visual question answering through iterative information discovery and synthesis. To incentivize the information seeking capabilities of our agents, we propose Similarity-based Policy Optimization (SPO), addressing baseline estimation bias in existing multi-turn reinforcement learning (RL) algorithms like GRPO. Our core insight is that in multi-turn RL, the more semantically similar two trajectories are, the more accurate their shared baseline estimation becomes. Leveraging this, SPO calculates a more precise baseline by similarity-weighted averaging of rewards across multiple trajectories, unlike GRPO which inappropriately applies the initial state’s baseline to all intermediate states. This provides a more stable and accurate learning signal for our agents, leading to superior training performance that surpasses GRPO. Our experiments on the MMLongbench-Doc benchmark show that MM-Doc-R1 outperforms previous baselines by 10.4%. Furthermore, SPO demonstrates superior performance over GRPO, boosting results by 5.0% with Qwen3-8B and 6.1% with Qwen3-4B. These results highlight the effectiveness of our integrated framework and novel training algorithm in advancing the state-of-the-art for complex, long-document visual question answering.

关键词: Retrieval-Augmented Generation, Long Document Visual Question Answering, Multi-turn Reinforcement Learning, Agentic Workflow, Similarity-based Policy Optimization, MM-Doc-R1, MMLongbench-Doc, Qwen

135. ❌ YOCO++: Enhancing YOCO with KV Residual Connections for Efficient LLM Inference

作者: You Wu, Ziheng Chen, Yizhen Zhang, Haoyi Wu, Chengting Yu, Yuchi Xu, Wenbo Su, Bo Zheng, Kewei Tu 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13556v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	10.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究KV缓存压缩技术（YOCO++方法），与’KV Cache Compression’高度相关（10分），直接属于大模型高效推理技术。论文明确针对LLMs，与’Large Language Models’高度相关（10分）。论文提到’高效推理’，与’Inference Acceleration’有一定关联（5分）。其他关键词如MoE、量化、对齐等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型推理中KV缓存压缩方法（YOCO）的性能下降问题，提出了增强方法YOCO++，通过引入加权残差连接在保持效率的同时提升了模型性能，在50%压缩率下达到了最先进的性能。

摘要翻译

跨层键值（KV）压缩已被证明对大型语言模型（LLMs）的高效推理具有显著效果。尽管这些方法能够减少KV缓存的存储消耗，但它们通常会引入不可忽视的性能下降。在本研究中，我们旨在提升YOCO——一种将中间层KV与上半部分层共享的跨层KV压缩方法——的性能。我们提出了YOCO++，这是一种增强版的YOCO，它在每个下半部分层的KV与底层之间引入了加权残差连接。与YOCO相比，YOCO++在保持相同训练和推理效率的同时，增强了模型容量。实验结果表明，在50%的KV缓存压缩率下，YOCO++在跨层KV压缩方法中实现了最先进的性能，超越了标准Transformer模型。

摘要 (Abstract)

Cross-layer key-value (KV) compression has been found to be effective in efficient inference of large language models (LLMs). Although they reduce the memory consumption of the KV cache, such methods usually introduce non-negligible performance degradation. In this work, we aim to enhance the performance of YOCO, a cross-layer KV compression method that shares the KVs of the middle layer with the top-half layers. We propose YOCO++, an enhanced YOCO that incorporates a weighted residual connection between the KVs of each bottom-half layer and the bottom layer. Compared to YOCO, YOCO++ increases model capacity while maintaining the same training and inference efficiency. Our experiments show that YOCO++ achieves state-of-the-art performance among the cross-layer KV compression methods at a 50% KV cache compression rate, outperforming the standard Transformer.

关键词: KV cache compression, efficient inference, large language models, cross-layer compression, weighted residual connection, YOCO++, inference acceleration, Transformer optimization

136. ❌ Debate to Align: Reliable Entity Alignment through Two-Stage Multi-Agent Debate

作者: Cunda Wang, Ziying Ma, Po Hu, Weihua Wang, Feilong Bao 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13551v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究基于大语言模型（LLMs）和多智能体系统进行实体对齐，因此与’Large Language Models’（核心方法）、‘LLM Agents’（框架基础）和’Multi-agent Systems’（核心机制）高度相关（10分）。论文提到’preference optimization’和’alignment’，与’Instruction Tuning OR Alignment’有一定关联（8分）。其他关键词如MoE、量化、推理加速、科学AI等未在摘要中体现，评为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于多智能体辩论的可靠实体对齐框架AgentEA，通过两阶段辩论机制显著提升了跨知识图谱的实体对齐效果。

摘要翻译

实体对齐（Entity Alignment，EA）旨在识别不同知识图谱（Knowledge Graphs，KGs）中指向同一现实世界对象的实体。近期基于大语言模型（Large Language Models，LLMs）的方法通常通过知识表示学习获取实体嵌入，并利用嵌入相似度识别一个对齐不确定实体集。针对每个不确定实体，基于嵌入相似度检索候选实体集（Candidate Entity Set，CES），以支持后续的对齐推理与决策。然而，CES的可靠性以及大语言模型的推理能力对后续对齐决策的效果具有关键影响。为解决这一问题，我们提出了AgentEA，一种基于多智能体辩论的可靠实体对齐框架。AgentEA首先通过实体表示偏好优化提升嵌入质量，随后引入一个由轻量级辩论验证和深度辩论对齐构成的两阶段多角色辩论机制，逐步提升对齐决策的可靠性，同时实现更高效的基于辩论的推理。在跨语言、稀疏、大规模及异构场景下的公开基准测试上的大量实验验证了AgentEA的有效性。

摘要 (Abstract)

Entity alignment (EA) aims to identify entities referring to the same real-world object across different knowledge graphs (KGs). Recent approaches based on large language models (LLMs) typically obtain entity embeddings through knowledge representation learning and use embedding similarity to identify an alignment-uncertain entity set. For each uncertain entity, a candidate entity set (CES) is then retrieved based on embedding similarity to support subsequent alignment reasoning and decision making. However, the reliability of the CES and the reasoning capability of LLMs critically affect the effectiveness of subsequent alignment decisions. To address this issue, we propose AgentEA, a reliable EA framework based on multi-agent debate. AgentEA first improves embedding quality through entity representation preference optimization, and then introduces a two-stage multi-role debate mechanism consisting of lightweight debate verification and deep debate alignment to progressively enhance the reliability of alignment decisions while enabling more efficient debate-based reasoning. Extensive experiments on public benchmarks under cross-lingual, sparse, large-scale, and heterogeneous settings demonstrate the effectiveness of AgentEA.

关键词: Entity Alignment, Large Language Models, Multi-agent Debate, Knowledge Graphs, Preference Optimization, Reliable Alignment, Two-stage Debate, AgentEA

137. ❌ Synthesizing Instruction-Tuning Datasets with Contrastive Decoding

作者: Tatsuya Ichinose, Youmi Ma, Masanari Oi, Ryuto Koike, Naoaki Okazaki 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13538v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究指令调优数据集生成方法，通过对比解码分离预训练知识和指令跟随能力。高度相关的关键词包括：LLMs（论文研究对象）、Pre-training（区分预训练知识）、Post-training/SFT（指令调优属于后训练）、Instruction Tuning（核心研究主题）。其他关键词如MoE、SLMs、RLHF、RAG等未在摘要中提及，与论文内容无关。

!!! tip deepseek-chat TL;DR

论文提出CoDIT方法，通过对比解码分离预训练知识和指令跟随能力来生成更有效的指令调优数据集，实验证明该方法优于直接生成和现有公开数据集。

摘要翻译

利用高性能大语言模型生成的响应进行指令微调已成为一种广泛采用的方法。然而，现有文献忽视了大语言模型生成响应的一个特性：它们将预训练阶段获得的世界知识与后训练阶段获得的指令遵循能力混为一谈。我们假设，将指令遵循能力与预训练知识进行解耦，能够提升指令微调的效果。为此，我们提出了CoDIT方法，该方法在响应生成过程中，在后训练模型与其预训练对应模型之间应用对比解码。该方法抑制了两个模型之间共享的预训练知识，同时放大了通过后训练获得的指令遵循行为，从而产生更纯粹反映指令遵循能力的响应。实验结果表明，使用通过CoDIT构建的数据集训练的模型，其性能始终优于使用直接生成响应训练的模型。在多个基准测试中，使用我们的数据集进行训练也优于使用现有公开指令微调数据集训练的性能。此外，我们从理论和实证上证明，CoDIT可被理解为将参数空间中的“对话向量”蒸馏到文本空间，从而实现了指令微调能力在不同架构模型间的迁移。

摘要 (Abstract)

Using responses generated by high-performing large language models (LLMs) for instruction tuning has become a widely adopted approach. However, the existing literature overlooks a property of LLM-generated responses: they conflate world knowledge acquired during pre-training with instruction-following capabilities acquired during post-training. We hypothesize that disentangling the instruction-following capabilities from pre-trained knowledge improves the effectiveness of instruction tuning. To this end, we propose CoDIT, a method that applies contrastive decoding between a post-trained model and its pre-trained counterpart during response generation. The method suppresses pre-trained knowledge shared between the two models while amplifying the instruction-following behavior acquired via post-training, resulting in responses that more purely reflect instruction-following capabilities. Experiment results demonstrate that models trained on datasets constructed via CoDIT consistently outperform those trained on directly generated responses. Training on our datasets also yields better performance than on existing publicly available instruction-tuning datasets across multiple benchmarks. Furthermore, we theoretically and empirically show that CoDIT can be interpreted as distilling the chat vector from parameter space to text space, enabling the transfer of instruction-tuning capabilities across models of different architectures.

关键词: instruction tuning, contrastive decoding, large language models, pre-training, post-training, dataset synthesis, knowledge disentanglement, CoDIT

138. ❌ ToolSpec: Accelerating Tool Calling via Schema-Aware and Retrieval-Augmented Speculative Decoding

作者: Heming Xia, Yongqi Li, Cunxiao Du, Mingbo Song, Wenjie Li 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13519v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	8.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	8.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM工具调用加速，与’Large Language Models’、‘Tool Use’、‘Speculative Decoding’高度相关（10分），是论文的核心技术和方法。‘Retrieval-Augmented Generation’和’LLM Agents’相关（8分），因为论文使用检索增强生成历史工具调用作为草稿，并涉及LLM代理工作流中的工具调用。其他关键词如MoE、量化、对齐等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

论文针对LLM工具调用在多步交互中产生的延迟问题，提出了ToolSpec方法，通过模式感知和检索增强的推测解码技术，实现了高达4.2倍的加速效果。

摘要翻译

工具调用通过使大语言模型（LLM）能够与外部应用程序交互，极大地扩展了其实用性。随着LLM能力的进步，有效的工具使用日益涉及多步骤、多轮次的交互以解决复杂任务。然而，由此产生的工具交互增长带来了显著的延迟，这对实时LLM服务构成了关键挑战。通过实证分析，我们发现工具调用轨迹具有高度结构性，遵循受限的模式（schema），并且经常表现出重复的调用模式。受此启发，我们提出了ToolSpec，一种基于模式感知、检索增强的推测解码方法，用于加速工具调用。ToolSpec利用预定义的工具模式生成准确的草稿，使用有限状态机在确定性模式令牌填充和可变字段的推测生成之间交替进行。此外，ToolSpec检索相似的历史工具调用并将其复用为草稿，以进一步提升效率。ToolSpec提供了一种即插即用的解决方案，可以无缝集成到现有的LLM工作流程中。在多个基准测试上的实验表明，ToolSpec实现了高达4.2倍的加速，显著优于现有的无需训练的推测解码方法。

摘要 (Abstract)

Tool calling has greatly expanded the practical utility of large language models (LLMs) by enabling them to interact with external applications. As LLM capabilities advance, effective tool use increasingly involves multi-step, multi-turn interactions to solve complex tasks. However, the resulting growth in tool interactions incurs substantial latency, posing a key challenge for real-time LLM serving. Through empirical analysis, we find that tool-calling traces are highly structured, conform to constrained schemas, and often exhibit recurring invocation patterns. Motivated by this, we propose ToolSpec, a schema-aware, retrieval-augmented speculative decoding method for accelerating tool calling. ToolSpec exploits predefined tool schemas to generate accurate drafts, using a finite-state machine to alternate between deterministic schema token filling and speculative generation for variable fields. In addition, ToolSpec retrieves similar historical tool invocations and reuses them as drafts to further improve efficiency. ToolSpec presents a plug-and-play solution that can be seamlessly integrated into existing LLM workflows. Experiments across multiple benchmarks demonstrate that ToolSpec achieves up to a 4.2x speedup, substantially outperforming existing training-free speculative decoding methods.

关键词: Tool Calling, Large Language Models, Speculative Decoding, Retrieval-Augmented, Schema-Aware, Inference Acceleration, Multi-step Interactions, Real-time Serving

139. ❌ Using reasoning LLMs to extract SDOH events from clinical notes

作者: Ertan Doganl, Kunyu Yu, Yifan Peng 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13502v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	10.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心是使用具有推理能力的大语言模型（LLMs）从临床笔记中提取社会健康决定因素（SDOH）事件，属于大模型在生物医学领域的应用。高度相关的关键词包括：LLMs（核心方法）、Chain of Thought/System 2 Thinking（论文强调推理能力）、In-context Learning（使用了few-shot learning）、AI for Science（生物医学应用）。Self-Correction有一定关联（使用了self-consistency机制）。其他关键词如MoE、SFT、RAG等未涉及，评分为0。

!!! tip deepseek-chat TL;DR

该研究利用具有推理能力的大语言模型，通过提示工程和少样本学习从临床笔记中提取社会健康决定因素事件，取得了0.866的微F1分数，证明了该方法在生物医学信息提取中的有效性。

摘要翻译

健康社会决定因素（Social Determinants of Health, SDOH）指影响个体生活、工作与衰老过程的环境、行为及社会条件。SDOH对个人健康结果具有显著影响，对其进行系统性识别与管理可大幅改善患者护理水平。然而，SDOH信息主要记录于电子健康记录的非结构化临床文本中，这限制了其作为机器可读数据的直接应用。为解决此问题，研究者已采用基于预训练BERT模型的自然语言处理技术，虽展现出良好性能，但需要复杂实现过程与大量计算资源。本研究探索了利用具备高级推理能力的大语言模型提取结构化SDOH事件的提示工程策略。我们的方法包含四个模块：1）结合既有指南开发简洁描述性提示模板，2）应用经精细筛选示例的小样本学习，3）采用自洽性机制确保输出稳定性，4）通过后处理进行质量控制。该方法取得了0.866的微平均F1分数，与主流模型相比展现出竞争优势。结果表明，具备推理能力的大语言模型为SDOH事件提取提供了有效解决方案，兼具实施简便性与卓越性能。

摘要 (Abstract)

Social Determinants of Health (SDOH) refer to environmental, behavioral, and social conditions that influence how individuals live, work, and age. SDOH have a significant impact on personal health outcomes, and their systematic identification and management can yield substantial improvements in patient care. However, SDOH information is predominantly captured in unstructured clinical notes within electronic health records, which limits its direct use as machine-readable entities. To address this issue, researchers have employed Natural Language Processing (NLP) techniques using pre-trained BERT-based models, demonstrating promising performance but requiring sophisticated implementation and extensive computational resources. In this study, we investigated prompt engineering strategies for extracting structured SDOH events utilizing LLMs with advanced reasoning capabilities. Our method consisted of four modules: 1) developing concise and descriptive prompts integrated with established guidelines, 2) applying few-shot learning with carefully curated examples, 3) using a self-consistency mechanism to ensure robust outputs, and 4) post-processing for quality control. Our approach achieved a micro-F1 score of 0.866, demonstrating competitive performance compared to the leading models. The results demonstrated that LLMs with reasoning capabilities are effective solutions for SDOH event extraction, offering both implementation simplicity and strong performance.

关键词: Large Language Models, reasoning capabilities, SDOH event extraction, clinical notes, prompt engineering, few-shot learning, self-consistency, bioinformatics

140. ❌ From Relevance to Authority: Authority-aware Generative Retrieval in Web Search Engines

作者: Sunkyung Lee, Jihye Back, Donghyeon Jeon, Soonhwan Kwon, Moonkwon Kim, Inho Kang, Jongwuk Lee 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13468v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究Generative Information Retrieval (GenIR)，这是基于大语言模型(LLMs)的检索任务，因此与’Large Language Models’高度相关(10分)。论文提出Authority-aware Generative Retriever (AuthGR)框架，本质上属于检索增强生成(RAG)的改进，与’Retrieval-Augmented Generation’高度相关(10分)。论文关注文档可信度和权威性，旨在缓解不可靠信息检索问题，这与’Hallucination Mitigation OR Factuality OR Truthfulness’有一定关联(8分)。其他关键词如MoE、SLMs、Scaling Laws、训练方法、推理优化、AI for Science等均未在摘要中提及或相关，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对生成式信息检索中过度关注相关性而忽视文档权威性的问题，提出了首个将权威性纳入生成式检索的框架AuthGR，通过多模态权威评分、三阶段训练和混合集成管道，在离线和在线评估中显著提升了检索结果的权威性、准确性和用户参与度。

摘要翻译

生成式信息检索（Generative Information Retrieval，GenIR）将检索过程构建为文本到文本的生成任务，利用大语言模型的广泛知识。然而，现有研究主要优化相关性，往往忽视文档的可信度。这在医疗和金融等高风险领域至关重要，仅依赖语义相关性进行检索可能获取不可靠信息。为解决这一问题，我们提出了权威感知生成式检索器（Authority-aware Generative Retriever，AuthGR），这是首个将权威性纳入GenIR的框架。AuthGR包含三个关键组件：（i）多模态权威性评分，采用视觉语言模型从文本和视觉线索中量化权威性；（ii）三阶段训练流程，逐步将权威意识注入检索器；（iii）混合集成流程，用于鲁棒部署。离线评估表明，AuthGR成功提升了权威性和准确性，我们的30亿参数模型性能可匹配140亿参数的基线模型。关键的是，在商业网络搜索平台上进行的大规模在线A/B测试和人工评估证实，该方法在实际用户参与度和可靠性方面均有显著提升。

摘要 (Abstract)

Generative information retrieval (GenIR) formulates the retrieval process as a text-to-text generation task, leveraging the vast knowledge of large language models. However, existing works primarily optimize for relevance while often overlooking document trustworthiness. This is critical in high-stakes domains like healthcare and finance, where relying solely on semantic relevance risks retrieving unreliable information. To address this, we propose an Authority-aware Generative Retriever (AuthGR), the first framework that incorporates authority into GenIR. AuthGR consists of three key components: (i) Multimodal Authority Scoring, which employs a vision-language model to quantify authority from textual and visual cues; (ii) a Three-stage Training Pipeline to progressively instill authority awareness into the retriever; and (iii) a Hybrid Ensemble Pipeline for robust deployment. Offline evaluations demonstrate that AuthGR successfully enhances both authority and accuracy, with our 3B model matching a 14B baseline. Crucially, large-scale online A/B tests and human evaluations conducted on the commercial web search platform confirm significant improvements in real-world user engagement and reliability.

关键词: Generative Information Retrieval, Authority-aware Retrieval, Large Language Models, Retrieval-Augmented Generation, Document Trustworthiness, Multimodal Authority Scoring, Web Search Engines, Online A/B Testing

141. ❌ CANVAS: Continuity-Aware Narratives via Visual Agentic Storyboarding

作者: Ishani Mondal, Yiwen Song, Mihir Parmar, Palash Goyal, Jordan Boyd-Graber, Tomas Pfister, Yale Song 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13452v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文CANVAS专注于视觉叙事中的连续性保持，提出一个多智能体框架来规划多镜头叙事中的视觉连续性。论文的核心是视觉叙事和智能体框架，与大多数大模型技术关键词（如LLM、MoE、Scaling Laws、训练方法、推理优化等）完全无关。仅与两个关键词相关：1. ‘LLM Agents OR Autonomous Agents OR Agentic Workflow’：论文明确使用多智能体框架（multi-agent framework），属于智能体工作流，相关度较高，给10分。2. ‘Multi-agent Systems OR Agent Coordination’：论文框架涉及多智能体系统，用于协调视觉连续性规划，相关度较高，给10分。其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

论文针对长格式视觉叙事中跨镜头连续性（如角色一致、背景稳定、场景过渡平滑）难以保持的问题，提出了一个多智能体框架CANVAS，通过显式规划视觉连续性，在多个基准测试中显著提升了背景连续性、角色一致性和道具一致性。

摘要翻译

长篇幅视觉叙事需要在多个镜头间保持连续性，包括角色一致性、环境稳定性以及场景转换的流畅性。尽管现有生成模型能够生成高质量的单帧画面，却无法维持此类连续性，导致外观变化、背景不一致及场景切换突兀等问题。本文提出CANVAS（基于视觉智能分镜的连续性感知叙事框架），这是一个通过多智能体系统显式规划多镜头叙事中视觉连续性的框架。CANVAS通过角色连续性、持久性背景锚点以及面向同一场景内平滑转换的位置感知场景规划来强化叙事连贯性。我们在两个分镜生成基准测试ST-BENCH和ViStoryBench上评估CANVAS，并针对长程叙事一致性提出了具有挑战性的新基准HardContinuityBench。实验表明CANVAS始终优于现有最佳基线模型，将背景连续性提升21.6%，角色一致性提升9.6%，道具一致性提升7.6%。

摘要 (Abstract)

Long-form visual storytelling requires maintaining continuity across shots, including consistent characters, stable environments, and smooth scene transitions. While existing generative models can produce strong individual frames, they fail to preserve such continuity, leading to appearance changes, inconsistent backgrounds, and abrupt scene shifts. We introduce CANVAS (Continuity-Aware Narratives via Visual Agentic Storyboarding), a multi-agent framework that explicitly plans visual continuity in multi-shot narratives. CANVAS enforces coherence through character continuity, persistent background anchors, and location-aware scene planning for smooth transitions within the same setting We evaluate CANVAS on two storyboard generation benchmarks ST-BENCH and ViStoryBench and introduce a new challenging benchmark HardContinuityBench for long-range narrative consistency. CANVAS consistently outperforms the best-performing baseline, improving background continuity by 21.6%, character consistency by 9.6% and props consistency by 7.6%.

关键词: visual storytelling, continuity, multi-agent framework, storyboard generation, character consistency, background continuity, scene planning, narrative consistency

142. ❌ Empirical Evidence of Complexity-Induced Limits in Large Language Models on Finite Discrete State-Space Problems with Explicit Validity Constraints

作者: Md. Fahad Ullah Utsho, Mohd. Ruhul Ameen, Akif Islam, Md. Golam Rashed, Dipankar Das 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13371v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）在复杂推理任务中的表现，直接涉及’Large Language Models’和’Chain of Thought/System 2 Thinking’关键词，给予10分；论文评估模型推理的鲁棒性和事实性，与’Hallucination Mitigation’和’Mechanistic Interpretability’有一定关联，给予5分；其他关键词如MoE、SFT、RAG等未在论文中涉及，给予0分。

!!! tip deepseek-chat TL;DR

该论文通过构建受控基准测试框架，系统评估大语言模型在九种经典推理任务中随着问题复杂度增加的表现，发现模型在低复杂度时表现良好，但超过特定阈值后准确率急剧下降，出现推理崩溃现象。

摘要翻译

大型语言模型（LLMs）日益被认为具备强大的推理能力，其在数学、逻辑和规划基准测试中的优异表现也支持了这一观点。然而，现有评估大多依赖于固定数据集的总体准确率，这掩盖了推理行为如何随任务复杂性增加而演变的过程。在本研究中，我们引入了一个受控的基准测试框架，以系统评估大型推理模型（LRMs）在问题复杂性逐步增加下的推理鲁棒性。我们构建了一套包含九项经典推理任务的测试集：布尔可满足性问题、算术谜题、图着色问题、渡河问题、汉诺塔、水壶问题、跳棋问题、数独和魔方还原，每项任务均经过参数化设计，以在保持底层语义的同时精确控制复杂性。通过使用确定性验证器，我们在低、中、高三种复杂性区间内评估了多个开源和专有LRMs，确保仅接受完全有效的解决方案。我们的结果揭示了一种一致的类相变行为：模型在低复杂性下能达到高准确率，但一旦超过任务特定的复杂性阈值，其性能便急剧下降。我们将此现象形式化为“推理崩溃”。在所有任务中，我们观察到准确率的大幅下降（通常超过50%），并伴随着不一致的推理路径、约束违反、状态跟踪丢失以及自信的错误输出。增加推理长度并不能可靠地提升正确性，且在一个问题族上的性能提升也无法迁移到其他问题族。这些发现凸显了评估方法需要超越静态基准测试，并应在受控复杂性下明确衡量推理鲁棒性的必要性。

摘要 (Abstract)

Large Language Models (LLMs) are increasingly described as possessing strong reasoning capabilities, supported by high performance on mathematical, logical, and planning benchmarks. However, most existing evaluations rely on aggregate accuracy over fixed datasets, obscuring how reasoning behavior evolves as task complexity increases. In this work, we introduce a controlled benchmarking framework to systematically evaluate the robustness of reasoning in Large Reasoning Models (LRMs) under progressively increasing problem complexity. We construct a suite of nine classical reasoning tasks: Boolean Satisfiability, Cryptarithmetic, Graph Coloring, River Crossing, Tower of Hanoi, Water Jug, Checker Jumping, Sudoku, and Rubik’s Cube, each parameterized to precisely control complexity while preserving underlying semantics. Using deterministic validators, we evaluate multiple open and proprietary LRMs across low, intermediate, and high complexity regimes, ensuring that only fully valid solutions are accepted. Our results reveal a consistent phase transition like behavior: models achieve high accuracy at low complexity but degrade sharply beyond task specific complexity thresholds. We formalize this phenomenon as reasoning collapse. Across tasks, we observe substantial accuracy declines, often exceeding 50%, accompanied by inconsistent reasoning traces, constraint violations, loss of state tracking, and confidently incorrect outputs. Increased reasoning length does not reliably improve correctness, and gains in one problem family do not generalize to others. These findings highlight the need for evaluation methodologies that move beyond static benchmarks and explicitly measure reasoning robustness under controlled complexity.

关键词: Large Language Models, reasoning capabilities, complexity thresholds, reasoning collapse, controlled benchmarking, deterministic validators, state tracking, constraint violations

143. ❌ TLoRA+: A Low-Rank Parameter-Efficient Fine-Tuning Method for Large Language Models

作者: Yarui Cao, Kai Liu 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13368v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	15.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究内容是提出一种新的参数高效微调方法TLoRA+，专门针对大语言模型。因此，与’PEFT/LoRA/Parameter-efficient Fine-tuning’高度相关（15分），这是论文的核心创新点。论文明确研究LLM的微调，与’Large Language Models’和’Post-training/Supervised Fine-tuning’高度相关（10分）。其他关键词如MoE、SLMs、RAG、量化等均未在摘要中提及，与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为TLoRA+的新型参数高效微调方法，用于大语言模型，该方法在保持低秩适应效率的同时，进一步提升了性能且未显著增加计算成本，并在GLUE基准测试中验证了其有效性和鲁棒性。

摘要翻译

微调大型语言模型旨在利用相对较小且领域特定的数据集，使预训练模型适应特定任务。在参数高效微调方法中，低秩自适应技术因能达到与全参数微调相当的性能，同时避免额外的推理延迟而表现突出。本文提出一种新颖的参数高效微调方法，将TLoRA+优化器融入预训练模型的权重矩阵中。该方法不仅保持了低秩自适应的效率，还能在不显著增加计算成本的前提下进一步提升性能。我们在GLUE基准测试中针对多种模型架构进行了实验。数值实验结果一致证明了所提出方法的有效性与鲁棒性。

摘要 (Abstract)

Fine-tuning large language models (LLMs) aims to adapt pre-trained models to specific tasks using relatively small and domain-specific datasets. Among Parameter-Efficient Fine-Tuning (PEFT) methods, Low-Rank Adaptation (LoRA) stands out by matching the performance of full fine-tuning while avoiding additional inference latency. In this paper, we propose a novel PEFT method that incorporates the TLoRA+ optimizer into the weight matrices of pre-trained models. The proposed approach not only preserves the efficiency of low-rank adaptation but also further enhances performance without significantly increasing computational cost. We conduct experiments on the GLUE benchmark across diverse model architectures. Numerical experiments consistently demonstrate the effectiveness and robustness of our proposed method.

关键词: Parameter-Efficient Fine-Tuning, LoRA, Large Language Models, Fine-tuning, Low-Rank Adaptation, TLoRA+, GLUE benchmark, Computational efficiency

144. ❌ AgentSPEX: An Agent SPecification and EXecution Language

作者: Pengcheng Wang, Jerry Huang, Jiarui Yao, Rui Pan, Peizhi Niu, Yaowenqi Liu, Ruida Wang, Renhao Lu, Yuwei Guo, Tong Zhang 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.13346v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文核心贡献是开发AgentSPEX语言和框架，用于结构化、模块化地指定和执行LLM-agent工作流，解决现有方法（如LangGraph、DSPy、CrewAI）中控制流和状态管理不明确、与Python紧耦合导致难以维护的问题。因此，与LLM Agents、Tool Use高度相关（10分），与Multi-agent Systems、Chain of Thought、System 2 Thinking、Explainable AI、AI for Science有一定关联（5分），因为这些涉及agent协调、推理、可解释性及科学应用；其他关键词如MoE、SLMs、训练技术、优化方法等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

论文针对现有语言模型agent系统依赖反应式提示、工作流逻辑与Python紧耦合导致难以控制和维护的问题，提出了AgentSPEX语言和框架，通过显式控制流、模块化结构和可视化编辑器来指定和执行LLM-agent工作流，并在多个基准测试和用户研究中证明了其有效性和可访问性。

摘要翻译

语言模型智能体系统通常依赖反应式提示技术，即通过单一指令引导模型执行开放式推理与工具调用序列，这种方式将控制流与中间状态隐式化，可能导致智能体行为难以控制。诸如LangGraph、DSPy和CrewAI等编排框架通过显式工作流定义增强了结构性，但将工作流逻辑与Python代码紧密耦合，使得智能体难以维护和修改。本文提出AgentSPEX（智能体规范与执行语言），这是一种用于定义具备显式控制流与模块化结构的LLM智能体工作流的规范语言，并配套提供可定制的智能体执行框架。AgentSPEX支持类型化步骤、分支与循环、并行执行、可复用子模块以及显式状态管理；这些工作流可在智能体执行框架中运行，该框架提供工具调用接口、沙盒化虚拟环境，并支持检查点、验证与日志功能。此外，我们开发了具备同步图谱视图与工作流视图的可视化编辑器，用于工作流编写与检查。我们提供了面向深度调研与科学研究的即用型智能体，并在7个基准测试上评估了AgentSPEX的性能。最后，通过用户研究表明，相较于现有主流智能体框架，AgentSPEX提供了更具可解释性与易用性的工作流编写范式。

摘要 (Abstract)

Language-model agent systems commonly rely on reactive prompting, in which a single instruction guides the model through an open-ended sequence of reasoning and tool-use steps, leaving control flow and intermediate state implicit and making agent behavior potentially difficult to control. Orchestration frameworks such as LangGraph, DSPy, and CrewAI impose greater structure through explicit workflow definitions, but tightly couple workflow logic with Python, making agents difficult to maintain and modify. In this paper, we introduce AgentSPEX, an Agent SPecification and EXecution Language for specifying LLM-agent workflows with explicit control flow and modular structure, along with a customizable agent harness. AgentSPEX supports typed steps, branching and loops, parallel execution, reusable submodules, and explicit state management, and these workflows execute within an agent harness that provides tool access, a sandboxed virtual environment, and support for checkpointing, verification, and logging. Furthermore, we provide a visual editor with synchronized graph and workflow views for authoring and inspection. We include ready-to-use agents for deep research and scientific research, and we evaluate AgentSPEX on 7 benchmarks. Finally, we show through a user study that AgentSPEX provides a more interpretable and accessible workflow-authoring paradigm than a popular existing agent framework.

关键词: LLM-agent workflows, control flow, modular structure, tool access, visual editor, agent harness, workflow specification, scientific research

145. ❌ English is Not All You Need: Systematically Exploring the Role of Multilinguality in LLM Post-Training

作者: Mehak Dhaliwal, Shashwat Chaurasia, Yao Qin, Dezhi Hong, Thomas Butler 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.13286v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM后训练中的多语言性，与’Large Language Models’和’Post-training OR Supervised Fine-tuning OR SFT’高度相关（10分），因为论文基于220次SFT实验系统研究多语言后训练。与’Tool Use OR Function Calling OR API Tool Use’有一定关联（5分），因为API调用是实验任务之一。其他关键词如MoE、SLMs、Scaling Laws、Instruction Tuning等未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该论文系统研究了多语言覆盖对LLM后训练的影响，发现增加语言覆盖普遍有益，即使加入单一非英语语言也能提升英语性能和跨语言泛化能力，且足够语言多样性下的零样本跨语言迁移可匹配低多样性设置中的直接语言包含效果。

摘要翻译

尽管大规模语言模型已实现多语言广泛部署，其训练后流程仍主要围绕英语展开，这导致不同语言间的性能差异。本研究基于220项在并行翻译的多语言数据混合集上进行的监督微调实验（涵盖数学推理与API调用任务，模型参数规模最高达80亿），对训练语言覆盖范围、模型规模与任务领域之间的相互作用展开了系统化对照研究。我们发现：在训练后阶段扩大语言覆盖范围对各类任务和模型规模普遍有益，其中低资源语言获益最大，高资源语言则呈现性能饱和而非下降。即使最低限度的多语言融入也有帮助：引入单一非英语语言既能提升英语任务表现，又能增强跨语言泛化能力，这使得纯英语训练后流程在很大程度上并非最优选择。此外，当语言多样性达到足够水平时，零样本跨语言迁移的效果可匹配甚至超越低多样性环境中直接包含特定语言的效果，但对于类型学差异大、资源稀缺的语言，其增益仍有限。

摘要 (Abstract)

Despite the widespread multilingual deployment of large language models, post-training pipelines remain predominantly English-centric, contributing to performance disparities across languages. We present a systematic, controlled study of the interplay between training language coverage, model scale, and task domain, based on 220 supervised fine-tuning runs on parallel translated multilingual data mixtures spanning mathematical reasoning and API calling tasks, with models up to 8B parameters. We find that increasing language coverage during post-training is largely beneficial across tasks and model scales, with low-resource languages benefiting the most and high-resource languages plateauing rather than degrading. Even minimal multilinguality helps: incorporating a single non-English language improves both English performance and cross-lingual generalization, making English-only post-training largely suboptimal. Moreover, at sufficient language diversity, zero-shot cross-lingual transfer can match or exceed the effects of direct language inclusion in a low-diversity setting, although gains remain limited for typologically distant, low-resource languages.

关键词: multilingual, post-training, supervised fine-tuning, language coverage, cross-lingual transfer, large language models, API calling, mathematical reasoning

146. ❌ L2D-Clinical: Learning to Defer for Adaptive Model Selection in Clinical Text Classification

作者: Rishik Kondadadi, John E. Ortega 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.13285v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）在临床文本分类中的应用，通过Learning to Defer框架自适应选择BERT或LLM模型，属于大模型在生物医学领域的应用创新。因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。其他关键词如MoE、SFT、RAG、量化等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了L2D-Clinical框架，通过自适应选择BERT或大语言模型进行临床文本分类，在ADE检测和MIMIC-IV治疗结果分类任务中分别将F1分数提升了1.7和9.3个百分点。

摘要翻译

临床文本分类需要在经过专门微调的模型（BERT变体）与通用大语言模型（LLMs）之间做出选择，但二者在所有实例中均未占据绝对优势。本文提出面向临床文本的学习延迟框架（L2D-Clinical），该框架通过学习不确定性信号与文本特征，决定BERT分类器何时应将任务转移给LLM进行处理。与先前假设人类专家始终更优的延迟学习研究不同，本方法实现了自适应延迟——当LLM能够弥补BERT不足时提升分类准确性。我们在两项英文临床任务上进行了评估：（1）药物不良事件检测（ADE Corpus V2数据集），其中BioBERT（F1=0.911）优于LLM（F1=0.765）；（2）治疗结果分类（基于MIMIC-IV数据库及多LLM共识标注的真实标签），其中GPT-5-nano（F1=0.967）优于ClinicalBERT（F1=0.887）。在药物不良事件检测任务中，L2D-Clinical通过选择性将7%的实例（这些实例中LLM的高召回率能弥补BERT的漏检）转移给LLM，实现了F1=0.928（较BERT提升1.7个百分点）。在MIMIC治疗结果分类任务中，仅需将16.8%的病例转移给LLM，L2D-Clinical即达到F1=0.980（较BERT提升9.3个百分点）。核心发现表明：L2D-Clinical能够学习选择性利用LLM的优势，同时最大限度控制API调用成本。

摘要 (Abstract)

Clinical text classification requires choosing between specialized fine-tuned models (BERT variants) and general-purpose large language models (LLMs), yet neither dominates across all instances. We introduce Learning to Defer for clinical text (L2D-Clinical), a framework that learns when a BERT classifier should defer to an LLM based on uncertainty signals and text characteristics. Unlike prior L2D work that defers to human experts assumed universally superior, our approach enables adaptive deferral-improving accuracy when the LLM complements BERT. We evaluate on two English clinical tasks: (1) ADE detection (ADE Corpus V2), where BioBERT (F1=0.911) outperforms the LLM (F1=0.765), and (2) treatment outcome classification (MIMIC-IV with multi-LLM consensus ground truth), where GPT-5-nano (F1=0.967) outperforms ClinicalBERT (F1=0.887). On ADE, L2D-Clinical achieves F1=0.928 (+1.7 points over BERT) by selectively deferring 7% of instances where the LLM’s high recall compensates for BERT’s misses. On MIMIC, L2D-Clinical achieves F1=0.980 (+9.3 points over BERT) by deferring only 16.8% of cases to the LLM. The key insight is that L2D-Clinical learns to selectively leverage LLM strengths while minimizing API costs.

关键词: clinical text classification, large language models, learning to defer, adaptive model selection, BERT, LLM, clinical NLP, uncertainty signals

147. ❌ Indexing Multimodal Language Models for Large-scale Image Retrieval

作者: Bahey Tharwat, Giorgos Kordopatis-Zilos, Pavel Suma, Ian Reid, Giorgos Tolias 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.13268v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究多模态大语言模型（MLLMs）在图像检索中的应用，与’Large Language Models’高度相关（10分），因为MLLMs是LLMs的扩展。与’Pre-training’相关（8分），因为论文利用了MLLMs在预训练阶段学到的视觉判别能力。与’Retrieval-Augmented Generation’有一定关联（5分），因为论文涉及检索任务，但并非生成式检索增强。其他关键词如MoE、SFT、RLHF等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了如何利用多模态大语言模型作为无需训练的相似度估计器，通过提示模型处理成对图像并将下一个令牌概率转换为相似度分数，实现了大规模图像检索中的零样本重排序，实验表明该方法在多种基准测试中优于特定任务的重排序器并展现出更强的鲁棒性。

摘要翻译

多模态大语言模型（MLLMs）已展现出强大的跨模态推理能力，但其在纯视觉任务中的潜力仍未得到充分探索。本研究将MLLMs作为无需训练的相似度估计器，用于实例级图像到图像检索。我们的方法通过输入成对图像提示模型，并将下一词元概率转换为相似度分数，从而实现在大规模检索流程中的零样本重排序。该设计避免了专用架构和微调，充分利用了多模态预训练期间学习到的丰富视觉判别能力。我们通过将MLLMs与内存高效索引及top-$k$候选重排序相结合，解决了可扩展性问题。在多样化基准测试上的实验表明，MLLMs在其原生领域之外超越了特定任务的重排序模型，并在处理杂乱背景、遮挡和小物体时表现出更强的鲁棒性。尽管取得了显著成果，我们也识别了在严重外观变化下的失效模式，这为未来研究指明了方向。我们的研究结果确立了MLLMs作为开放世界大规模图像检索领域一种具有前景的替代方案。

摘要 (Abstract)

Multimodal Large Language Models (MLLMs) have demonstrated strong cross-modal reasoning capabilities, yet their potential for vision-only tasks remains underexplored. We investigate MLLMs as training-free similarity estimators for instance-level image-to-image retrieval. Our approach prompts the model with paired images and converts next-token probabilities into similarity scores, enabling zero-shot re-ranking within large-scale retrieval pipelines. This design avoids specialized architectures and fine-tuning, leveraging the rich visual discrimination learned during multimodal pre-training. We address scalability by combining MLLMs with memory-efficient indexing and top-$k$ candidate re-ranking. Experiments across diverse benchmarks show that MLLMs outperform task-specific re-rankers outside their native domains and exhibit superior robustness to clutter, occlusion, and small objects. Despite strong results, we identify failure modes under severe appearance changes, highlighting opportunities for future research. Our findings position MLLMs as a promising alternative for open-world large-scale image retrieval.

关键词: Multimodal Large Language Models, image retrieval, similarity estimation, zero-shot re-ranking, cross-modal reasoning, training-free, large-scale retrieval, visual discrimination

148. ❌ Better and Worse with Scale: How Contextual Entrainment Diverges with Model Size

作者: Dikshant Kukreja, Kshitij Sah, Gautam Gupta, Avinash Anand, Rajiv Ratn Shah, Zhengkui Wang, Aik Beng Ng, Erik Cambria 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.13275v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	10.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	5.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）的缩放规律（Scaling Laws），分析模型规模如何影响上下文处理能力，因此与’Large Language Models’和’Scaling Laws’高度相关（10分）。研究涉及模型对错误信息的抵抗（与’Hallucination Mitigation’相关，5分）、上下文学习行为（与’In-context Learning’相关，5分）以及模型行为的可解释性分析（与’Mechanistic Interpretability’相关，5分）。其他关键词如MoE、SLMs、训练方法、推理加速、AI for Science等均未在论文中涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文研究了语言模型规模缩放对上下文处理能力的影响，发现随着模型增大，对语义上下文的敏感性降低（更能抵抗错误信息），但对非语义上下文的敏感性增加（更易复制无关标记），揭示了缩放规律会重塑而非解决上下文敏感性问题。

摘要翻译

大型语言模型在处理上下文信息时呈现出同步增强与弱化的矛盾趋势——在忽略错误主张方面表现更优，在过滤无关标记方面表现更差。我们通过首次提出的上下文顺应性标度律对这一显性悖论进行形式化界定，该规律指模型倾向于偏好上下文中出现的标记而忽略其相关性的特性。通过分析Cerebras-GPT（111M-13B）和Pythia（410M-12B）模型系列，我们发现顺应性遵循可预测的幂律标度关系，但其变化趋势因上下文类型呈现相反走向：语义上下文的顺应性随模型规模扩大而递减，非语义上下文的顺应性则随规模扩大而递增。具体而言，最大模型对反事实错误信息的抵抗能力是最小模型的四倍，但同时复制任意标记的倾向性也达到两倍。这些在不同模型系列中复现的 divergent trends（分化趋势）表明，语义过滤与机械复制是两种功能独立且标度规律相悖的行为——单纯扩大模型规模并不能解决上下文敏感性问题，而是重塑了其表现形式。

摘要 (Abstract)

Larger language models become simultaneously better and worse at handling contextual information – better at ignoring false claims, worse at ignoring irrelevant tokens. We formalize this apparent paradox through the first scaling laws for contextual entrainment, the tendency of models to favor tokens that appeared in context regardless of relevance. Analyzing the Cerebras-GPT (111M-13B) and Pythia (410M-12B) model families, we find entrainment follows predictable power-law scaling, but with opposite trends depending on context type: semantic contexts show decreasing entrainment with scale, while non-semantic contexts show increasing entrainment. Concretely, the largest models are four times more resistant to counterfactual misinformation than the smallest, yet simultaneously twice as prone to copying arbitrary tokens. These diverging trends, which replicate across model families, suggest that semantic filtering and mechanical copying are functionally distinct behaviors that scale in opposition – scaling alone does not resolve context sensitivity, it reshapes it.

关键词: scaling laws, contextual entrainment, large language models, model size, semantic context, non-semantic context, Cerebras-GPT, Pythia

149. ❌ Hessian-Enhanced Token Attribution (HETA): Interpreting Autoregressive LLMs

作者: Vishal Pramanik, Maisha Maliha, Nathaniel D. Bastian, Sumit Kumar Jha 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.13258v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于为仅解码器的大语言模型（LLMs）开发一种新的归因解释方法（HETA），核心贡献在于提升LLMs的可解释性，因此与’Large Language Models OR LLMs OR Foundation Models’和’Mechanistic Interpretability OR Explainable AI’高度相关（10分）。论文不涉及其他关键词所描述的具体技术（如MoE、SFT、RAG、量化等）、应用领域（如科学AI）或特定能力（如思维链、智能体），故其他关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对仅解码器的大语言模型提出了一种新的归因解释框架HETA，通过结合语义转移向量、Hessian敏感度评分和KL散度，在多个模型和数据集上验证了其在归因忠实度和与人工标注对齐方面优于现有方法。

摘要翻译

归因方法旨在通过量化输入词元对生成输出的贡献来解释语言模型的预测。然而，现有技术大多针对基于编码器的架构设计，并依赖于线性近似，这些方法无法捕捉仅解码器模型中自回归生成过程的因果与语义复杂性。为应对这些局限，我们提出海森增强词元归因（Hessian-Enhanced Token Attribution, HETA），这是一种专为仅解码器语言模型设计的新型归因框架。HETA融合了三个互补组件：捕捉跨层词元间影响的语义转移向量、建模二阶效应的基于海森矩阵的敏感度分数，以及通过KL散度衡量词元被遮蔽时的信息损失。这一统一设计能够生成具有上下文感知性、因果忠实性和语义基础性的归因结果。此外，我们构建了一个精选基准数据集，用于系统评估生成场景下的归因质量。在多个模型和数据集上的实证评估表明，HETA在归因忠实度以及与人工标注的一致性方面持续优于现有方法，为自回归语言模型的可解释性研究确立了新标准。

摘要 (Abstract)

Attribution methods seek to explain language model predictions by quantifying the contribution of input tokens to generated outputs. However, most existing techniques are designed for encoder-based architectures and rely on linear approximations that fail to capture the causal and semantic complexities of autoregressive generation in decoder-only models. To address these limitations, we propose Hessian-Enhanced Token Attribution (HETA), a novel attribution framework tailored for decoder-only language models. HETA combines three complementary components: a semantic transition vector that captures token-to-token influence across layers, Hessian-based sensitivity scores that model second-order effects, and KL divergence to measure information loss when tokens are masked. This unified design produces context-aware, causally faithful, and semantically grounded attributions. Additionally, we introduce a curated benchmark dataset for systematically evaluating attribution quality in generative settings. Empirical evaluations across multiple models and datasets demonstrate that HETA consistently outperforms existing methods in attribution faithfulness and alignment with human annotations, establishing a new standard for interpretability in autoregressive language models.

关键词: Attribution methods, Autoregressive LLMs, Decoder-only models, Interpretability, Hessian-enhanced, Token attribution, Explainable AI, Benchmark dataset

150. ❌ InfiniteScienceGym: An Unbounded, Procedurally-Generated Benchmark for Scientific Analysis

作者: Oliver Bentham, Vivek Srikumar 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.13201v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心是评估大语言模型在科学分析中的能力，因此与’Large Language Models’和’AI for Science’高度相关（10分）。论文涉及证据推理、工具使用和幻觉缓解，与’Retrieval-Augmented Generation’、‘Chain of Thought’、‘System 2 Thinking’、‘LLM Agents’、‘Tool Use’和’Hallucination Mitigation’有一定关联（5分）。其他关键词如MoE、量化、训练方法等未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个程序生成的科学分析基准InfiniteScienceGym，用于评估大语言模型在证据推理、工具使用和识别不可回答问题方面的能力，发现现有模型准确率均低于45%且识别不可回答问题仍是主要弱点。

摘要翻译

大型语言模型正逐渐成为科学助手，但评估其从经验数据中进行推理的能力仍具挑战性。基于已发表研究和人工标注构建的基准测试往往存在发表偏倚、已知知识偏倚、标签噪声以及巨大的存储需求等问题。本文提出InfiniteScienceGym——一个通过程序化生成科学知识库的基准测试框架，并配套可验证的问答任务。该模拟器从初始种子出发，能够确定性地生成一个包含逼真目录结构、文件及表格数据的自包含知识库，同时通过特权问答生成器产生可回答与不可回答的问题，并提供精确的标准答案。这使得在受控环境中评估证据驱动推理、答案保留机制以及工具辅助分析成为可能，而无需分发大规模静态语料库。InfiniteScienceGym通过针对传统基准测试的盲点和失效模式，弥补了真实科学基准的不足——这些缺陷仅靠已发表数据集往往难以评估。通过对专有模型和开源权重模型的测试，我们发现所有模型的总体准确率均未超过45%，识别不可回答问题仍是主要弱点，且性能更强的模型倾向于更有效地使用工具而非单纯消耗更多计算资源。

摘要 (Abstract)

Large language models are emerging as scientific assistants, but evaluating their ability to reason from empirical data remains challenging. Benchmarks derived from published studies and human annotations inherit publication bias, known-knowledge bias, label noise, and substantial storage requirements. We present InfiniteScienceGym, a procedurally generated benchmark of scientific repositories paired with a verifiable question-answering task. From a seed, the simulator deterministically generates a self-contained repository with realistic directory structure, files, and tabular data, and a privileged QA generator produces both answerable and unanswerable questions with exact ground truth. This makes it possible to evaluate evidence-grounded reasoning, abstention, and tool-mediated analysis in a controlled setting without distributing a large static corpus. InfiniteScienceGym complements real scientific benchmarks by targeting blind spots and failure modes that are hard to evaluate using published datasets alone. Evaluating both proprietary and open-weight models, we find that none achieve more than 45% accuracy overall, that recognizing unanswerable questions remains a major weakness, and that stronger models tend to use tools more effectively rather than simply consuming more tokens.

关键词: Large language models, Scientific analysis, Benchmark, Procedurally generated, Evidence-grounded reasoning, Tool-mediated analysis, Question-answering, Abstention

151. ❌ Evaluating the Evaluator: Problems with SemEval-2020 Task 1 for Lexical Semantic Change Detection

作者: Bach Phan-Tat, Kris Heylen, Dirk Geeraerts, Stefano De Pascale, Dirk Speelmana 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.13232v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文是对SemEval-2020 Task 1基准测试的批判性评估，专注于词汇语义变化检测任务的操作化、数据质量和基准设计问题。论文内容属于计算语言学和自然语言处理领域，但完全不涉及大模型、深度学习技术原理或AI在科学领域的应用创新。所有评分关键词都针对大模型技术、训练方法、推理优化、对齐技术、应用场景等具体方向，而本文讨论的是传统NLP基准测试的评估框架和方法论问题，与这些关键词无任何关联。

!!! tip deepseek-chat TL;DR

本文批判性地评估了SemEval-2020 Task 1基准测试，指出其在操作化框架、数据质量和基准设计方面存在局限性，并呼吁未来研究采用更全面的语义变化理论、提高数据透明度和扩展语言覆盖范围。

摘要翻译

本讨论性论文通过一个包含操作化、数据质量和基准设计三部分的评估框架，重新审视了词汇语义变化检测领域最具影响力的共享基准——SemEval-2020 Task 1。首先，在操作化层面，我们认为该基准主要将语义变化建模为离散义项的增加、减少或重新分配。尽管这种框架便于标注和评估，但其范围过于狭窄，无法捕捉渐进的、构式的、搭配的以及话语层面的变化。此外，其黄金标准标签是标注决策、聚类流程和阈值设定的结果，这可能限制任务的有效性。其次，在数据质量层面，我们指出该基准受到显著的语料库及预处理问题的影响，包括OCR噪声、畸形字符、截断句子、不一致的词形还原、词性标注错误以及目标词遗漏。这些问题可能扭曲模型行为，使语言分析复杂化，并降低可复现性。第三，在基准设计层面，我们认为其精心挑选的小规模目标词集和有限的语言覆盖范围降低了现实性，并增加了统计不确定性。综上所述，这些局限性表明，该基准应被视为一个有用但不完整的测试平台，而非衡量进展的绝对标准。因此，我们呼吁未来的数据集和共享任务采用更广泛的语义变化理论，透明记录预处理过程，扩展跨语言覆盖范围，并使用更贴近现实的评估设置。这些步骤对于在词汇语义变化检测领域取得更有效、可解释且可推广的进展至关重要。

摘要 (Abstract)

This discussion paper re-examines SemEval-2020 Task 1, the most influential shared benchmark for lexical semantic change detection, through a three-part evaluative framework: operationalisation, data quality, and benchmark design. First, at the level of operationalisation, we argue that the benchmark models semantic change mainly as gain, loss, or redistribution of discrete senses. While practical for annotation and evaluation, this framing is too narrow to capture gradual, constructional, collocational, and discourse-level change. Also, the gold labels are outcomes of annotation decisions, clustering procedures, and threshold settings, which could potentially limit the validity of the task. Second, at the level of data quality, we show that the benchmark is affected by substantial corpus and preprocessing problems, including OCR noise, malformed characters, truncated sentences, inconsistent lemmatisation, POS-tagging errors, and missed targets. These issues can distort model behaviour, complicate linguistic analysis, and reduce reproducibility. Third, at the level of bench-mark design, we argue the small curated target sets and limited language coverage reduce realism and increase statistical uncertainty. Taken together, these limitations suggest that the benchmark should be treated as a useful but partial test bed rather than a definitive measure of progress. We therefore call for future datasets and shared tasks to adopt broader theories of semantic change, document pre-processing transparently, expand cross-linguistic coverage, and use more realistic evaluation settings. Such steps are necessary for more valid, interpretable, and generalisable progress in lexical semantic change detection

关键词: lexical semantic change detection, benchmark evaluation, SemEval-2020 Task 1, data quality, operationalisation, corpus preprocessing, evaluation framework, linguistic analysis

152. ❌ Unleashing Implicit Rewards: Prefix-Value Learning for Distribution-Level Optimization

作者: Shiping Gao, Hongzhan Chen, Xiaojun Quan, Qifan Wang, Lifu Huang 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.13197v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文主要研究强化学习（RL）在推理过程中的应用，特别是通过隐式过程奖励模型（PRMs）和分布级RL来优化推理步骤。核心相关关键词包括：1) ‘RLHF OR RLAIF OR Direct Preference Optimization OR DPO’（10分）：论文直接研究在线RL中的奖励模型和强化学习优化，是核心方法。2) ‘Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’（10分）：论文基于ProcessBench评估推理过程，直接涉及多步推理。3) ‘System 2 Thinking OR Slow Thinking OR In-depth Reasoning’（10分）：论文关注推理过程的逐步验证和深入分析。4) ‘Large Language Models OR LLMs OR Foundation Models’（8分）：论文虽未明确提及LLMs，但PRMs和推理优化通常应用于大模型上下文，有一定关联。其他关键词如MoE、量化、RAG等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文解决了隐式过程奖励模型在训练-推理不匹配导致token级奖励不可靠的问题，提出了IPVRM模型来学习前缀条件值函数并通过时间差分推导步骤信号，从而在ProcessBench上显著提高了步骤验证F1，并进一步提出了分布级RL方法实现密集反事实更新。

摘要翻译

过程奖励模型（PRMs）能够沿推理路径提供细粒度的奖励信号，但训练可靠的PRMs通常需要步骤标注或繁重的验证流程，这使得其在在线强化学习（RL）中难以规模化扩展和更新。隐式过程奖励模型通过从轨迹级结果标签中学习可分解的词元级或步骤级奖励，降低了这一成本。然而，它们存在训练-推断不匹配问题：训练仅约束序列级聚合指标，而推断则需要词元级分数以反映局部步骤质量。这导致词元级奖励的识别较弱，可能无法真实反映哪些推理步骤实际正确。这种不可靠性削弱了隐式PRMs的核心优势：对大量候选词元进行评分。实践中，带有噪声的词元级优势值可能系统性地强化错误的推理延续。为解决此问题，我们提出了一种新颖的隐式前缀价值奖励模型（Implicit Prefix-Value Reward Model, IPVRM），它直接学习一个前缀条件价值函数来估计最终正确的概率，并通过时序差分（Temporal-Difference, TD）差值推导步骤级信号。IPVRM在ProcessBench基准上显著提升了步骤验证的F1分数。基于这些校准后的前缀价值，我们进一步提出分布级强化学习（Distribution-Level RL, DistRL），该方法同时为已采样的词元和高概率候选词元计算时序差分优势，从而无需额外推演即可实现密集的反事实更新。尽管在使用未校准的隐式奖励时DistRL收益有限，但一旦与IPVRM结合，它便能持续提升下游推理性能。

摘要 (Abstract)

Process reward models (PRMs) provide fine-grained reward signals along the reasoning process, but training reliable PRMs often requires step annotations or heavy verification pipelines, making them expensive to scale and refresh during online RL. Implicit PRMs mitigate this cost by learning decomposable token- or step-level rewards from trajectory-level outcome labels. However, they suffer from a train-inference mismatch: training only constrains a sequence-level aggregate, whereas inference requires token-level scores to reflect local step quality. As a result, token-level credits are weakly identified and may fail to faithfully reflect which reasoning steps are actually correct. This unreliability undermines a key promise of implicit PRMs: scoring many candidate tokens. In practice, noisy per-token advantages may systematically reinforce incorrect continuations. We address this problem with a novel Implicit Prefix-Value Reward Model (IPVRM), which directly learns a prefix-conditioned value function estimating the probability of eventual correctness, and derives step signals via temporal-difference (TD) differences. IPVRM substantially improves step-verification F1 on ProcessBench. Building on these calibrated prefix values, we further propose Distribution-Level RL (DistRL), which computes TD advantages for both sampled tokens and high-probability candidate tokens, enabling dense counterfactual updates without additional rollouts. While DistRL offers limited gains when powered by miscalibrated implicit rewards, it consistently improves downstream reasoning once paired with IPVRM.

关键词: Process Reward Models, Implicit PRMs, Prefix-Value Learning, Temporal-Difference, Distribution-Level RL, Reasoning Process, Step Verification, Token-level Rewards

153. ❌ Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

作者: Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wenkai Yang, Zhiyuan Liu, Ning Ding 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.13016v2

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究On-policy Distillation (OPD)技术，这是大语言模型后训练(post-training)中的关键技术，因此与’Large Language Models’和’Post-training’高度相关(10分)。论文未涉及其他关键词如MoE、SLMs、Scaling Laws、Instruction Tuning、RLHF、RAG、推理技术、代理、压缩等具体内容，也未涉及科学AI应用，因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

本文系统研究了大型语言模型在策略蒸馏的训练动态和机制，发现成功蒸馏需要师生模型思维模式兼容且教师提供新能力，并提出了恢复失败蒸馏的实用策略。

摘要翻译

同策略蒸馏已成为大语言模型后训练的核心技术，但其训练动态机制尚未得到充分理解。本文对同策略蒸馏的动态过程与机制进行了系统性研究。我们首先发现决定同策略蒸馏成败的两个关键条件：（一）学生模型与教师模型应具备兼容的思维模式；（二）即使思维模式一致且教师评分更高，教师仍需提供学生训练过程中未曾接触的真正新能力。我们通过弱到强反向蒸馏验证了这些发现，表明从学生模型视角看，同家族的1.5B与7B参数教师在分布上不可区分。在词元层面机制的探究中，我们发现成功的同策略蒸馏表现为：在学生访问状态的高概率词元上实现渐进式对齐，这些集中于97%-99%概率质量的小规模共享词元集合构成了关键。我们进一步提出两种实用策略以挽救失败的蒸馏过程：离策略冷启动和教师对齐提示选择。最后，我们揭示同策略蒸馏表面上的密集词元级奖励“免费午餐”实则存在代价，这引发了同策略蒸馏能否扩展到长程蒸馏任务的根本性问题。

摘要 (Abstract)

On-policy distillation (OPD) has become a core technique in the post-training of large language models, yet its training dynamics remain poorly understood. This paper provides a systematic investigation of OPD dynamics and mechanisms. We first identify that two conditions govern whether OPD succeeds or fails: (i) the student and teacher should share compatible thinking patterns; and (ii) even with consistent thinking patterns and higher scores, the teacher must offer genuinely new capabilities beyond what the student has seen during training. We validate these findings through weak-to-strong reverse distillation, showing that same-family 1.5B and 7B teachers are distributionally indistinguishable from the student’s perspective. Probing into the token-level mechanism, we show that successful OPD is characterized by progressive alignment on high-probability tokens at student-visited states, a small shared token set that concentrates most of the probability mass (97%-99%). We further propose two practical strategies to recover failing OPD: off-policy cold start and teacher-aligned prompt selection. Finally, we show that OPD’s apparent free lunch of dense token-level reward comes at a cost, raising the question of whether OPD can scale to long-horizon distillation.

关键词: On-policy distillation, Large language models, Post-training, Training dynamics, Teacher-student compatibility, Token-level mechanism, Weak-to-strong distillation, Distillation scaling

154. ❌ Growing Pains: Extensible and Efficient LLM Benchmarking Via Fixed Parameter Calibration

作者: Eliya Habba, Itay Itzhak, Asaf Yehudai, Yotam Perlitz, Elron Bandel, Michal Shmueli-Scheuer, Leshem Choshen, Gabriel Stanovsky 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.12843v2

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于LLM基准测试的评估框架，核心贡献是提出基于多维项目反应理论（IRT）的方法，使用锚定项目校准新基准，以解决模型在不同数据集上评估结果不可比的问题。因此，仅与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为论文直接涉及LLM的评估和基准测试。其他关键词如MoE、SLMs、训练技术、推理方法、代理系统、科学AI应用等均未在论文中涉及，故得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于多维项目反应理论的框架，通过固定参数校准和锚定项目，使语言模型基准测试可扩展且高效，能在不同评估期间保持分数可比性，实验表明仅用100个锚定问题即可在400多个模型上准确预测全评估性能。

摘要翻译

语言模型与评测基准的快速迭代使得对每个模型进行全量数据集评估的成本日益高昂。实践中，模型常在不同样本上进行评估，导致跨研究的结果难以直接比较。为解决此问题，我们提出一个基于多维项目反应理论（Item Response Theory, IRT）的框架，该框架通过锚定题项将新基准校准至现有评估体系，同时保持已校准题项参数固定。我们的方法支持一种现实评估场景：数据集随时间逐步引入，模型仅基于评估时可用数据集进行测试，而每个数据集使用固定锚定题项集，使得不同评估周期的结果能够直接比较。在涵盖超过400个模型的大规模实验中，本框架仅需每个数据集100道锚定题项即可将全量评估性能预测误差控制在2-3个百分点内，斯皮尔曼等级相关系数ρ≥0.9，表明能够在保持分数可比性的同时随时间扩展基准体系，且每个新数据集的评估成本保持恒定。代码发布于https://github.com/eliyahabba/growing-pains。

摘要 (Abstract)

The rapid release of both language models and benchmarks makes it increasingly costly to evaluate every model on every dataset. In practice, models are often evaluated on different samples, making scores difficult to compare across studies. To address this, we propose a framework based on multidimensional Item Response Theory (IRT) that uses anchor items to calibrate new benchmarks to the evaluation suite while holding previously calibrated item parameters fixed. Our approach supports a realistic evaluation setting in which datasets are introduced over time and models are evaluated only on the datasets available at the time of evaluation, while a fixed anchor set for each dataset is used so that results from different evaluation periods can be compared directly. In large-scale experiments on more than $400$ models, our framework predicts full-evaluation performance within 2-3 percentage points using only $100$ anchor questions per dataset, with Spearman $ρ\geq 0.9$ for ranking preservation, showing that it is possible to extend benchmark suites over time while preserving score comparability, at a constant evaluation cost per new dataset. Code available at https://github.com/eliyahabba/growing-pains

关键词: LLM benchmarking, Item Response Theory, anchor items, parameter calibration, evaluation framework, score comparability, model evaluation, benchmark extension

155. ❌ Seedance 2.0: Advancing Video Generation for World Complexity

作者: Team Seedance, De Chen, Liyang Chen, Xin Chen, Ying Chen, Zhuo Chen, Zhuowei Chen, Feng Cheng, Tianheng Cheng, Yufeng Cheng, Mojie Chi, Xuyan Chi, Jian Cong, Qinpeng Cui, Fei Ding, Qide Dong, Yujiao Du, Haojie Duanmu, Junliang Fan, Jiarui Fang, Jing Fang, Zetao Fang, Chengjian Feng, Yu Gao, Diandian Gu, Dong Guo, Hanzhong Guo, Qiushan Guo, Boyang Hao, Hongxiang Hao, Haoxun He, Jiaao He, Qian He, Tuyen Hoang, Heng Hu, Ruoqing Hu, Yuxiang Hu, Jiancheng Huang, Weilin Huang, Zhaoyang Huang, Zhongyi Huang, Jishuo Jin, Ming Jing, Ashley Kim, Shanshan Lao, Yichong Leng, Bingchuan Li, Gen Li, Haifeng Li, Huixia Li, Jiashi Li, Ming Li, Xiaojie Li, Xingxing Li, Yameng Li, Yiying Li, Yu Li, Yueyan Li, Chao Liang, Han Liang, Jianzhong Liang, Ying Liang, Wang Liao, J. H. Lien, Shanchuan Lin, Xi Lin, Feng Ling, Yue Ling, Fangfang Liu, Jiawei Liu, Jihao Liu, Jingtuo Liu, Shu Liu, Sichao Liu, Wei Liu, Xue Liu, Zuxi Liu, Ruijie Lu, Lecheng Lyu, Jingting Ma, Tianxiang Ma, Xiaonan Nie, Jingzhe Ning, Junjie Pan, Xitong Pan, Ronggui Peng, Xueqiong Qu, Yuxi Ren, Yuchen Shen, Guang Shi, Lei Shi, Yinglong Song, Fan Sun, Li Sun, Renfei Sun, Wenjing Tang, Boyang Tao, Zirui Tao, Dongliang Wang, Feng Wang, Hulin Wang, Ke Wang, Qingyi Wang, Rui Wang, Shuai Wang, Shulei Wang, Weichen Wang, Xuanda Wang, Yanhui Wang, Yue Wang, Yuping Wang, Yuxuan Wang, Zijie Wang, Ziyu Wang, Guoqiang Wei, Meng Wei, Di Wu, Guohong Wu, Hanjie Wu, Huachao Wu, Jian Wu, Jie Wu, Ruolan Wu, Shaojin Wu, Xiaohu Wu, Xinglong Wu, Yonghui Wu, Ruiqi Xia, Xin Xia, Xuefeng Xiao, Shuang Xu, Bangbang Yang, Jiaqi Yang, Runkai Yang, Tao Yang, Yihang Yang, Zhixian Yang, Ziyan Yang, Fulong Ye, Bingqian Yi, Xing Yin, Yongbin You, Linxiao Yuan, Weihong Zeng, Xuejiao Zeng, Yan Zeng, Siyu Zhai, Zhonghua Zhai, Bowen Zhang, Chenlin Zhang, Heng Zhang, Jun Zhang, Manlin Zhang, Peiyuan Zhang, Shuo Zhang, Xiaohe Zhang, Xiaoying Zhang, Xinyan Zhang, Xinyi Zhang, Yichi Zhang, Zixiang Zhang, Haiyu Zhao, Huating Zhao, Liming Zhao, Yian Zhao, Guangcong Zheng, Jianbin Zheng, Xiaozheng Zheng, Zerong Zheng, Kuan Zhu, Feilong Zuo 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.14148v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文Seedance 2.0专注于多模态音频-视频生成模型，其核心是视频和音频的联合生成技术，涉及架构设计、多模态输入支持和生成质量提升。所有评分关键词均围绕大语言模型（LLM）及相关技术（如训练方法、推理优化、对齐、代理等），而本文未提及任何基于语言模型的技术，也未涉及科学领域的AI应用。因此，所有关键词与论文内容完全无关，得分为0。

!!! tip deepseek-chat TL;DR

Seedance 2.0提出了一种新的原生多模态音频-视频生成模型，通过统一高效的架构支持文本、图像、音频和视频输入，在视频和音频生成的关键维度上实现了全面改进，性能达到领域领先水平。

摘要翻译

Seedance 2.0 是一款全新的原生多模态音视频生成模型，于2026年2月上旬在中国正式发布。相较于其前代版本 Seedance 1.0 和 1.5 Pro，Seedance 2.0 采用了一种统一、高效且规模化的多模态音视频联合生成架构。这使其能够支持文本、图像、音频和视频四种输入模态，并整合了迄今为止业界最全面的多模态内容参考与编辑功能套件。该模型在视频和音频生成的所有关键子维度上均实现了全面且显著的提升。在专家评估和公开用户测试中，该模型均展现出与领域领先水平相当的性能。Seedance 2.0 支持直接生成时长为4至15秒的音视频内容，原生输出分辨率为480p和720p。对于作为参考的多模态输入，其当前开放平台最多支持3个视频片段、9张图像和3个音频片段。此外，我们还提供了 Seedance 2.0 Fast 版本，这是 Seedance 2.0 的加速变体，旨在为低延迟场景提升生成速度。Seedance 2.0 在其基础生成能力和多模态生成性能上均实现了重大改进，为终端用户带来了更优质的创作体验。

摘要 (Abstract)

Seedance 2.0 is a new native multi-modal audio-video generation model, officially released in China in early February 2026. Compared with its predecessors, Seedance 1.0 and 1.5 Pro, Seedance 2.0 adopts a unified, highly efficient, and large-scale architecture for multi-modal audio-video joint generation. This allows it to support four input modalities: text, image, audio, and video, by integrating one of the most comprehensive suites of multi-modal content reference and editing capabilities available in the industry to date. It delivers substantial, well-rounded improvements across all key sub-dimensions of video and audio generation. In both expert evaluations and public user tests, the model has demonstrated performance on par with the leading levels in the field. Seedance 2.0 supports direct generation of audio-video content with durations ranging from 4 to 15 seconds, with native output resolutions of 480p and 720p. For multi-modal inputs as reference, its current open platform supports up to 3 video clips, 9 images, and 3 audio clips. In addition, we provide Seedance 2.0 Fast version, an accelerated variant of Seedance 2.0 designed to boost generation speed for low-latency scenarios. Seedance 2.0 has delivered significant improvements to its foundational generation capabilities and multi-modal generation performance, bringing an enhanced creative experience for end users.

关键词: video generation, audio-video generation, multimodal model, unified architecture, content reference, editing capabilities, generation performance, low-latency scenarios

156. ❌ One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding

作者: Zheyu Zhang, Ziqi Pang, Shixing Chen, Xiang Hao, Vimal Bhat, Yu-Xiong Wang 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.14149v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	10.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究长视频理解中的极端压缩技术，通过token级压缩（LP-Comp）和帧级压缩（QC-Comp）解决LLM上下文长度限制问题。与LLM高度相关（10分），因为方法基于LLM层进行可学习压缩；与Supervised Fine-tuning相关（10分），因为使用监督压缩调优阶段提升性能；与Long Context LLMs相关（10分），因为直接解决长视频带来的长上下文挑战；与Model Compression相关（5分），因为涉及token压缩但非传统模型权重压缩。其他关键词如MoE、SLMs、Scaling Laws等与论文内容无直接关联。

!!! tip deepseek-chat TL;DR

该论文针对长视频理解中因大语言模型上下文长度限制导致的信息丢失问题，提出了token级和帧级极端压缩方法，实现了更高的压缩比和更密集的帧采样，在多个长视频基准测试中显著提升了准确率。

摘要翻译

长视频理解对视觉语言模型（VLMs）而言本质上面临挑战，因为其涉及海量视频帧。由于每帧视频通常扩展为数十至数百个标记，而大语言模型（LLMs）的上下文长度有限，迫使VLMs只能稀疏感知视频帧并丢失时序信息。为解决此问题，我们探索在最终LLM层实现极端的视频标记压缩，目标达到每帧仅对应一个标记。我们的核心见解是：先前方法广泛采用的基于启发式的压缩容易导致信息丢失，这需要将LLM层监督为可学习的、渐进式的标记级压缩模块（LP-Comp）。这种压缩使我们的VLM能够处理2倍至4倍更多的视频帧，同时提升性能。为进一步提高标记效率，我们研究了帧级压缩，该方法通过LLM层内部注意力分数选择与查询最相关的视频帧，称为问题条件压缩（QC-Comp）。与先前研究的一个显著区别在于，我们通过将长视频分割为短片段并采用局部注意力，缓解了LLM在长上下文中的位置偏差（即过度关注序列首尾的问题）。综合来看，我们结合标记级与帧级压缩，构建了一个用于长视频理解的极端压缩模型，命名为**\name**，实现了显著更高的压缩比，并支持更密集的帧采样。我们的\name模型基于VideoChat-Flash进行微调，通过一个数据高效的监督压缩调优阶段（仅需2.5%的监督微调数据），在LVBench上将准确率从42.9%提升至46.2%，并在其他多个长视频基准测试中取得增强效果。

摘要 (Abstract)

Long video understanding is inherently challenging for vision-language models (VLMs) because of the extensive number of frames. With each video frame typically expanding into tens or hundreds of tokens, the limited context length of large language models (LLMs) forces the VLMs to perceive the frames sparsely and lose temporal information. To address this, we explore extreme video token compression towards \emph{one token per frame} at the final LLM layer. Our key insight is that heuristic-based compression, widely adopted by previous methods, is prone to information loss, and this necessitates supervising LLM layers into \emph{learnable} and \emph{progressive} modules for \emph{token-level compression} (LP-Comp). Such compression enables our VLM to digest 2x-4x more frames with improved performance. To further increase the token efficiency, we investigate \emph{frame-level compression}, which selects the frames most relevant to the queries via the internal attention scores of the LLM layers, named \emph{question-conditioned compression} (QC-Comp). As a notable distinction from previous studies, we mitigate the position bias of LLM attention in long contexts, \emph{i.e.}, the over-concentration on the beginning and end of a sequence, by splitting long videos into short segments and employing local attention. Collectively, our combined \emph{token-level} and \emph{frame-level} leads to an e\textbf{x}treme compression model for long video understanding, named \textbf{\name}, achieving a significantly larger compression ratio and enabling denser frame sampling. Our \name is finetuned from VideoChat-Flash with a data-efficient \emph{supervised compression tuning} stage that only requires 2.5% of the supervised fine-tuning data, yet boosts the accuracy from 42.9% to 46.2% on LVBench and enhances multiple other long video benchmarks.

关键词: long video understanding, token compression, frame compression, large language models, context length, supervised fine-tuning, attention bias mitigation, video-language models

157. ❌ ROSE: Retrieval-Oriented Segmentation Enhancement

作者: Song Tang, Guangquan Jie, Henghui Ding, Yu-Gang Jiang 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.14147v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文ROSE专注于增强基于多模态大语言模型（MLLMs）的分割模型，通过检索增强生成（RAG）技术解决新颖和新兴实体的分割问题。因此，与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为MLLMs是核心基础模型；与’Retrieval-Augmented Generation OR RAG OR Retrieval-Generation’高度相关（10分），因为论文的核心创新是引入互联网检索增强生成模块。其他关键词如MoE、SLMs、Scaling Laws、训练技术、推理优化、代理系统等均未在论文中涉及，故评分为0分。论文未提及生物信息学等特定科学领域应用，因此’AI for Science’等关键词也得0分。

!!! tip deepseek-chat TL;DR

论文提出了ROSE框架，通过检索增强生成技术解决多模态大语言模型在分割新颖和新兴实体时的知识更新问题，显著提升了在NEST基准上的性能。

摘要翻译

现有基于多模态大语言模型（MLLMs）的分割模型（如LISA）常因无法融入最新知识而在处理新颖或新兴实体时面临困难。为应对这一挑战，我们提出了新兴实体分割任务（Novel Emerging Segmentation Task, NEST），该任务专注于分割两类实体：（i）因未出现在训练数据中而未被MLLMs识别的新颖实体，以及（ii）虽存在于模型知识库中但需要最新外部信息才能准确识别的新兴实体。为支持NEST研究，我们通过自动化流程构建了NEST基准数据集，该数据集利用新闻相关数据样本生成以进行全面评估。此外，我们提出了ROSE：检索导向的分割增强框架，这是一个即插即用的模块化框架，旨在增强任何基于MLLM的分割模型。ROSE包含四个核心组件：首先，引入互联网检索增强生成模块，利用用户提供的多模态输入检索实时网络信息；其次，文本提示增强器通过最新信息和丰富背景知识增强模型，提升其对新兴实体的感知能力；再者，视觉提示增强器通过利用网络来源的图像，弥补MLLMs对新奇实体缺乏接触的不足；为保持效率，我们引入了网络感知模块，可根据用户输入智能决定何时调用检索机制。实验结果表明，ROSE在NEST基准上显著提升了性能，其广义交并比（gIoU）指标比基于Gemini-2.0 Flash的强检索基线高出19.2分。

摘要 (Abstract)

Existing segmentation models based on multimodal large language models (MLLMs), such as LISA, often struggle with novel or emerging entities due to their inability to incorporate up-to-date knowledge. To address this challenge, we introduce the Novel Emerging Segmentation Task (NEST), which focuses on segmenting (i) novel entities that MLLMs fail to recognize due to their absence from training data, and (ii) emerging entities that exist within the model’s knowledge but demand up-to-date external information for accurate recognition. To support the study of NEST, we construct a NEST benchmark using an automated pipeline that generates news-related data samples for comprehensive evaluation. Additionally, we propose ROSE: Retrieval-Oriented Segmentation Enhancement, a plug-and-play framework designed to augment any MLLM-based segmentation model. ROSE comprises four key components. First, an Internet Retrieval-Augmented Generation module is introduced to employ user-provided multimodal inputs to retrieve real-time web information. Then, a Textual Prompt Enhancer enriches the model with up-to-date information and rich background knowledge, improving the model’s perception ability for emerging entities. Furthermore, a Visual Prompt Enhancer is proposed to compensate for MLLMs’ lack of exposure to novel entities by leveraging internet-sourced images. To maintain efficiency, a WebSense module is introduced to intelligently decide when to invoke retrieval mechanisms based on user input. Experimental results demonstrate that ROSE significantly boosts performance on the NEST benchmark, outperforming a strong Gemini-2.0 Flash-based retrieval baseline by 19.2 in gIoU.

关键词: Multimodal Large Language Models, Retrieval-Augmented Generation, Segmentation, Novel Emerging Segmentation Task, Internet Retrieval, Plug-and-play Framework, NEST Benchmark, Real-time Web Information

158. ❌ Geometric Context Transformer for Streaming 3D Reconstruction

作者: Lin-Zhuo Chen, Jian Gao, Yihang Chen, Ka Leong Cheng, Yipengjing Sun, Liangxiao Hu, Nan Xue, Xing Zhu, Yujun Shen, Yao Yao, Yinghao Xu 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.14141v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的流式3D重建，提出了一种基于几何上下文变换器（GCT）的3D基础模型LingBot-Map。虽然论文提到了'3D foundation model’，但这是针对3D场景重建的专用模型，而非自然语言处理领域的大语言模型（LLMs）。论文的核心技术涉及SLAM、几何注意力机制、坐标定位、漂移校正等计算机视觉和机器人学概念，与评分关键词列表中的所有大语言模型相关技术（如LLMs、MoE、SFT、RLHF、RAG、CoT、Agents等）以及AI for Science（生物信息学、化学信息学）均无直接关联。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为LingBot-Map的流式3D重建基础模型，通过几何上下文变换器架构实现了从视频流中高效、准确地恢复相机姿态和点云，在多个基准测试中超越了现有方法。

摘要翻译

流式三维重建旨在从视频流中恢复相机位姿与点云等三维信息，这对几何精度、时序一致性与计算效率提出了综合要求。受同步定位与建图（SLAM）原理启发，我们基于几何上下文变换器（GCT）架构，提出了LingBot-Map——一种用于从流式数据重建场景的前馈式三维基础模型。LingBot-Map的核心特征在于其精心设计的注意力机制，该机制融合了锚点上下文、位姿参考窗口与轨迹记忆模块，分别用于解决坐标对齐、密集几何线索提取与长时漂移校正问题。这一设计在保持流式状态紧凑性的同时，保留了丰富的几何上下文信息，使其能够在长序列（超过10,000帧）上以约20 FPS的速度对518×378分辨率的输入进行稳定高效推理。在多种基准测试上的广泛评估表明，相较于现有的流式方法及基于迭代优化的方法，我们的方法均取得了更优的性能。

摘要 (Abstract)

Streaming 3D reconstruction aims to recover 3D information, such as camera poses and point clouds, from a video stream, which necessitates geometric accuracy, temporal consistency, and computational efficiency. Motivated by the principles of Simultaneous Localization and Mapping (SLAM), we introduce LingBot-Map, a feed-forward 3D foundation model for reconstructing scenes from streaming data, built upon a geometric context transformer (GCT) architecture. A defining aspect of LingBot-Map lies in its carefully designed attention mechanism, which integrates an anchor context, a pose-reference window, and a trajectory memory to address coordinate grounding, dense geometric cues, and long-range drift correction, respectively. This design keeps the streaming state compact while retaining rich geometric context, enabling stable efficient inference at around 20 FPS on 518 x 378 resolution inputs over long sequences exceeding 10,000 frames. Extensive evaluations across a variety of benchmarks demonstrate that our approach achieves superior performance compared to both existing streaming and iterative optimization-based approaches.

关键词: Streaming 3D Reconstruction, Geometric Context Transformer, LingBot-Map, SLAM, Camera Poses, Point Clouds, Attention Mechanism, Long-range Drift Correction

159. ❌ Training-Free Semantic Multi-Object Tracking with Vision-Language Models

作者: Laurence Bonat, Francesco Tonini, Elisa Ricci, Lorenzo Vaquero 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.14074v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出TF-SMOT，一个免训练的语义多目标跟踪管道，结合了预训练组件（如InternVideo2.5和LLM）进行视频语言生成和语义检索，因此与’Large Language Models’（用于消歧）、‘Pre-training’（使用预训练模型）和’Retrieval-Augmented Generation’（基于语义检索）相关，但并非核心焦点；其他关键词与论文的计算机视觉和视频分析主题无关。

!!! tip deepseek-chat TL;DR

该论文解决了语义多目标跟踪（SMOT）系统需要昂贵监督训练且难以快速适应新模型的问题，提出了免训练的TF-SMOT管道，通过组合预训练组件在BenSMOT基准上实现了最先进的跟踪性能并提高了摘要和字幕质量。

摘要翻译

语义多目标跟踪（Semantic Multi-Object Tracking, SMOT）通过视频摘要、实例级描述和交互标签等语义输出扩展了多目标跟踪，旨在从轨迹追踪转向对动态场景的人类可理解描述。现有的SMOT系统采用端到端训练，其进展依赖于昂贵的监督数据，限制了快速适应新基础模型与新交互类型的能力。我们提出了TF-SMOT，一种免训练的SMOT流程，它整合了预训练的检测、基于掩码的跟踪以及视频-语言生成组件。TF-SMOT结合D-FINE与可提示的SAM2分割跟踪器以生成时序一致的轨迹片段，利用轮廓定位技术配合InternVideo2.5生成视频摘要和实例描述，并通过基于词义定义的语义检索（借助大语言模型消歧）将提取的交互谓词对齐至BenSMOT WordNet同义词集。在BenSMOT数据集上，TF-SMOT在SMOT设定下实现了最先进的跟踪性能，并在摘要和描述质量上超越了现有方法。然而，在细粒度且长尾的WordNet标签空间上进行严格精确匹配评估时，交互识别仍具挑战性；我们的分析与消融实验表明，语义重叠与标签粒度显著影响了测得的性能表现。

摘要 (Abstract)

Semantic Multi-Object Tracking (SMOT) extends multi-object tracking with semantic outputs such as video summaries, instance-level captions, and interaction labels, aiming to move from trajectories to human-interpretable descriptions of dynamic scenes. Existing SMOT systems are trained end-to-end, coupling progress to expensive supervision, limiting the ability to rapidly adapt to new foundation models and new interactions. We propose TF-SMOT, a training-free SMOT pipeline that composes pretrained components for detection, mask-based tracking, and video-language generation. TF-SMOT combines D-FINE and the promptable SAM2 segmentation tracker to produce temporally consistent tracklets, uses contour grounding to generate video summaries and instance captions with InternVideo2.5, and aligns extracted interaction predicates to BenSMOT WordNet synsets via gloss-based semantic retrieval with LLM disambiguation. On BenSMOT, TF-SMOT achieves state-of-the-art tracking performance within the SMOT setting and improves summary and caption quality compared to prior art. Interaction recognition, however, remains challenging under strict exact-match evaluation on the fine-grained and long-tailed WordNet label space; our analysis and ablations indicate that semantic overlap and label granularity substantially affect measured performance.

关键词: Semantic Multi-Object Tracking, Training-Free, Vision-Language Models, Video Summaries, Instance Captions, Interaction Recognition, Pretrained Components, LLM Disambiguation

160. ❌ Towards Unconstrained Human-Object Interaction

作者: Francesco Tonini, Alessandro Conti, Lorenzo Vaquero, Cigdem Beyan, Elisa Ricci 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.14069v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文的核心是应用多模态大语言模型（MLLMs）解决计算机视觉中的人-物交互检测问题，属于大模型在特定领域（计算机视觉）的应用研究。论文明确提到使用MLLMs，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文未涉及其他关键词所描述的具体技术原理（如MoE、量化、对齐、推理加速等）或特定科学领域（如生物信息学），因此其他关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文提出了无约束人-物交互检测任务，并利用多模态大语言模型实现了无需预定义交互词汇的开放式交互识别，结果表明MLLMs在该任务上优于传统方法。

摘要翻译

人-物交互（Human-Object Interaction, HOI）检测是一个长期的计算机视觉问题，旨在预测人与物体之间的交互关系。当前的HOI模型在训练和推理时依赖于固定的交互词汇表，这限制了其在动态开放环境中的适用性。随着多模态大语言模型（Multimodal Large Language Models, MLLMs）的出现，探索更灵活的交互识别范式已成为可能。在本研究中，我们通过MLLMs的视角重新审视HOI检测，并将其应用于真实开放场景的HOI检测。我们定义了无约束HOI（Unconstrained HOI, U-HOI）任务，这是一个新颖的HOI领域，其取消了在训练和推理阶段对预定义交互列表的依赖。我们在此设定下评估了一系列MLLMs，并引入了一个包含测试时推理和语言到图转换的流程，以从自由格式文本中提取结构化的交互关系。我们的研究结果凸显了当前HOI检测器的局限性以及MLLMs在U-HOI任务中的价值。代码将在https://github.com/francescotonini/anyhoi 公开。

摘要 (Abstract)

Human-Object Interaction (HOI) detection is a longstanding computer vision problem concerned with predicting the interaction between humans and objects. Current HOI models rely on a vocabulary of interactions at training and inference time, limiting their applicability to static environments. With the advent of Multimodal Large Language Models (MLLMs), it has become feasible to explore more flexible paradigms for interaction recognition. In this work, we revisit HOI detection through the lens of MLLMs and apply them to in-the-wild HOI detection. We define the Unconstrained HOI (U-HOI) task, a novel HOI domain that removes the requirement for a predefined list of interactions at both training and inference. We evaluate a range of MLLMs on this setting and introduce a pipeline that includes test-time inference and language-to-graph conversion to extract structured interactions from free-form text. Our findings highlight the limitations of current HOI detectors and the value of MLLMs for U-HOI. Code will be available at https://github.com/francescotonini/anyhoi

关键词: Human-Object Interaction, HOI detection, Multimodal Large Language Models, MLLMs, Unconstrained HOI, in-the-wild detection, language-to-graph conversion, computer vision

161. ❌ OneHOI: Unifying Human-Object Interaction Generation and Editing

作者: Jiun Tian Hoe, Weipeng Hu, Xudong Jiang, Yap-Peng Tan, Chee Seng Chan 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.14062v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉中的人-物交互（HOI）生成与编辑，提出了一种基于扩散变换器的统一框架（OneHOI）。虽然属于深度学习应用，但研究内容与所有评分关键词（均围绕大语言模型技术、训练方法、推理优化、对齐、代理系统等）完全无关。论文未涉及任何语言模型、MoE、缩放定律、训练技术、推理方法、代理系统或科学AI应用，因此所有关键词得分为0。

!!! tip deepseek-chat TL;DR

该论文提出了OneHOI，一个统一的扩散变换器框架，解决了人-物交互生成与编辑中的条件整合和交互解耦问题，并在HOI-Edit-44K数据集上实现了最先进的性能。

摘要翻译

人-物交互建模旨在捕捉人类如何作用于物体并与之建立关联，通常以<人物、动作、物体>三元组形式表示。现有方法分为两个独立的类别：HOI生成方法根据结构化三元组和布局合成场景，但难以整合HOI与纯物体实体等混合条件；HOI编辑方法通过文本修改交互关系，却难以解耦姿态与物理接触，且难以扩展到多重交互场景。本文提出OneHOI——一个统一的扩散Transformer框架，通过共享的结构化交互表征驱动的条件去噪过程，将HOI生成与编辑整合于单一架构。其核心是关系扩散Transformer，该模块通过角色与实例感知的HOI令牌、基于布局的空间动作定位、强化交互拓扑的结构化HOI注意力机制，以及解耦多重HOI场景的HOI旋转位置编码，对动词中介的关系进行建模。通过在HOI-Edit-44K数据集及HOI与物体中心数据集上结合模态丢弃策略进行联合训练，OneHOI支持布局引导、无布局、任意掩码及混合条件控制，在HOI生成与编辑任务上均取得了最先进的性能。代码发布于https://jiuntian.github.io/OneHOI/。

摘要 (Abstract)

Human-Object Interaction (HOI) modelling captures how humans act upon and relate to objects, typically expressed as <person, action, object> triplets. Existing approaches split into two disjoint families: HOI generation synthesises scenes from structured triplets and layout, but fails to integrate mixed conditions like HOI and object-only entities; and HOI editing modifies interactions via text, yet struggles to decouple pose from physical contact and scale to multiple interactions. We introduce OneHOI, a unified diffusion transformer framework that consolidates HOI generation and editing into a single conditional denoising process driven by shared structured interaction representations. At its core, the Relational Diffusion Transformer (R-DiT) models verb-mediated relations through role- and instance-aware HOI tokens, layout-based spatial Action Grounding, a Structured HOI Attention to enforce interaction topology, and HOI RoPE to disentangle multi-HOI scenes. Trained jointly with modality dropout on our HOI-Edit-44K, along with HOI and object-centric datasets, OneHOI supports layout-guided, layout-free, arbitrary-mask, and mixed-condition control, achieving state-of-the-art results across both HOI generation and editing. Code is available at https://jiuntian.github.io/OneHOI/.

关键词: Human-Object Interaction, HOI generation, HOI editing, diffusion transformer, Relational Diffusion Transformer, structured interaction representations, layout-guided control, HOI-Edit-44K

162. ❌ Free Geometry: Refining 3D Reconstruction from Longer Versions of Itself

作者: Yuhang Dai, Xingyi Yang 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.14048v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于3D重建领域，提出Free Geometry框架，通过自监督任务和轻量级LoRA更新改进基础模型。与大多数大模型/深度学习技术关键词无关，仅与’PEFT/LoRA/Parameter-efficient Fine-tuning’高度相关（论文明确使用LoRA进行快速重新校准），与’Self-Correction/Self-Improvement/Self-Reflection’有一定关联（涉及模型自我进化/改进）。其他关键词如LLMs、MoE、Scaling Laws、Alignment等均未涉及。

!!! tip deepseek-chat TL;DR

论文提出Free Geometry框架，通过自监督任务和轻量级LoRA更新，使前馈3D重建模型能够在测试时自我进化，无需3D真实数据，在多个基准数据集上显著提升了相机姿态准确性和点云图预测性能。

摘要翻译

前馈式三维重建模型高效但缺乏灵活性：一旦训练完成，它们以零样本方式进行推理，无法适应测试场景。因此，视觉上看似合理的重建结果常包含误差，尤其在遮挡、镜面反射和模糊线索的情况下。为解决此问题，我们提出Free Geometry框架，使前馈式三维重建模型能够在测试时无需任何三维真值的情况下自我进化。我们的核心洞见是，当模型接收更多视角时，其产生的重建结果更可靠且视角一致性更高。利用这一特性，给定测试序列，我们通过掩码部分帧来构建自监督任务。Free Geometry强制要求完整观测与部分观测所得表征之间保持跨视角特征一致性，同时维持被预留帧所隐含的成对关系。这种自监督机制通过轻量级LoRA更新实现快速重新校准，在单GPU上每个数据集耗时不足2分钟。我们的方法在4个基准数据集上持续改进了包括Depth Anything 3和VGGT在内的前沿基础模型，在相机姿态精度上平均提升3.73%，在点云图预测上平均提升2.88%。代码发布于https://github.com/hiteacherIamhumble/Free-Geometry。

摘要 (Abstract)

Feed-forward 3D reconstruction models are efficient but rigid: once trained, they perform inference in a zero-shot manner and cannot adapt to the test scene. As a result, visually plausible reconstructions often contain errors, particularly under occlusions, specularities, and ambiguous cues. To address this, we introduce Free Geometry, a framework that enables feed-forward 3D reconstruction models to self-evolve at test time without any 3D ground truth. Our key insight is that, when the model receives more views, it produces more reliable and view-consistent reconstructions. Leveraging this property, given a testing sequence, we mask a subset of frames to construct a self-supervised task. Free Geometry enforces cross-view feature consistency between representations from full and partial observations, while maintaining the pairwise relations implied by the held-out frames. This self-supervision allows for fast recalibration via lightweight LoRA updates, taking less than 2 minutes per dataset on a single GPU. Our approach consistently improves state-of-the-art foundation models, including Depth Anything 3 and VGGT, across 4 benchmark datasets, yielding an average improvement of 3.73% in camera pose accuracy and 2.88% in point map prediction. Code is available at https://github.com/hiteacherIamhumble/Free-Geometry .

关键词: 3D reconstruction, self-supervised learning, LoRA, feed-forward models, test-time adaptation, cross-view consistency, foundation models, camera pose accuracy

163. ❌ Seek-and-Solve: Benchmarking MLLMs for Visual Clue-Driven Reasoning in Daily Scenarios

作者: Xiaomin Li, Tala Wang, Zichen Zhong, Ying Zhang, Zirui Zheng, Takashi Isobe, Dezhuang Li, Huchuan Lu, You He, Xu Jia 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.14041v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于多模态大语言模型（MLLMs）在日常生活场景中的视觉线索驱动推理能力评估，核心涉及大模型（LLMs）的推理能力（Chain of Thought/System 2 Thinking）和智能体（LLM Agents）应用，与这些关键词高度相关（10分）。其他关键词如MoE、量化、对齐等未在摘要中提及，属于完全无关（0分）。

!!! tip deepseek-chat TL;DR

该研究针对多模态大语言模型在日常生活场景中缺乏视觉线索驱动推理能力评估的问题，提出了DailyClue基准测试，并通过评估发现准确识别视觉线索对稳健推理至关重要。

摘要翻译

日常场景以视觉信息丰富为特征，这要求多模态大语言模型（MLLMs）能够过滤噪声并识别决定性的视觉线索以进行精确推理。然而，当前的基准测试主要旨在评估MLLMs的既有知识或感知理解能力，往往忽视了其关键的推理能力。为弥补这一差距，我们推出了DailyClue，一个专为日常场景中视觉线索驱动推理而设计的基准测试。我们的构建遵循两个核心原则：（1）严格基于真实的日常活动；（2）设计具有挑战性的查询，要求超越表层感知。我们的问题并非简单的识别，而是迫使MLLMs主动探索合适的视觉线索，并利用这些线索进行后续推理。为此，我们构建了一个涵盖四大日常领域和16个不同子任务的综合数据集。对多种MLLMs及智能体模型的全面评估凸显了本基准测试所带来的巨大挑战。我们的分析揭示了若干关键发现，强调准确识别视觉线索对于实现稳健推理至关重要。

摘要 (Abstract)

Daily scenarios are characterized by visual richness, requiring Multimodal Large Language Models (MLLMs) to filter noise and identify decisive visual clues for accurate reasoning. Yet, current benchmarks predominantly aim at evaluating MLLMs’ pre-existing knowledge or perceptual understanding, often neglecting the critical capability of reasoning. To bridge this gap, we introduce DailyClue, a benchmark designed for visual clue-driven reasoning in daily scenarios. Our construction is guided by two core principles: (1) strict grounding in authentic daily activities, and (2) challenging query design that necessitates more than surface-level perception. Instead of simple recognition, our questions compel MLLMs to actively explore suitable visual clues and leverage them for subsequent reasoning. To this end, we curate a comprehensive dataset spanning four major daily domains and 16 distinct subtasks. Comprehensive evaluation across MLLMs and agentic models underscores the formidable challenge posed by our benchmark. Our analysis reveals several critical insights, emphasizing that the accurate identification of visual clues is essential for robust reasoning.

关键词: Multimodal Large Language Models, Visual Clue-Driven Reasoning, Daily Scenarios, Benchmark, MLLMs, Reasoning Capability, DailyClue, Visual Clues

164. ❌ POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch

作者: Yikun Liu, Yuan Liu, Le Tian, Xiao Zhou, Jiangchao Yao, Yanfeng Wang, Weidi Xie 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.14029v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	8.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	15.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	5.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究多模态智能体搜索模型，与’LLM Agents/Autonomous Agents/Agentic Workflow’高度相关（15分），‘Retrieval-Augmented Generation/RAG/Retrieval-Generation’和’Tool Use/Function Calling/API Tool Use’直接对应搜索工具使用（10分）。论文涉及长上下文交互（‘Context Window Extension/Long Context LLMs’，8分）、多步推理（‘Chain of Thought/CoT Reasoning/Multi-step Reasoning’和’System 2 Thinking/Slow Thinking/In-depth Reasoning’，各8分），以及大模型基础（‘Large Language Models/LLMs/Foundation Models’，8分）。其他如预训练、微调、指令调优、幻觉缓解、上下文学习有一定关联（各5分）。其余关键词如MoE、量化、科学AI等未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文研究如何从零开始训练一个多模态智能体搜索模型，通过引入Agentic Seeding阶段和V-Fold历史压缩方案，解决了长时程交互中的性能瓶颈，最终开发的POINTS-Seeker-8B模型在六个基准测试中超越了现有模型。

摘要翻译

尽管大规模多模态模型（LMMs）展现出卓越的视觉感知能力，但其认知仍受限于静态的参数化知识。为突破这一局限，多模态搜索模型被引入，以主动与外部环境交互进行证据检索。不同于当前主流范式仅将通用LMMs与搜索工具作为模块化扩展进行简单适配，我们探索了从头构建多模态具身搜索模型的潜力。具体而言，本研究作出以下贡献：（i）我们提出“具身播种”这一专门阶段，旨在编织激发具身行为所需的基础前驱条件；（ii）我们揭示了长程交互中的性能瓶颈：随着交互历史的不断累积，模型定位真实证据的能力会因信息过载而下降。为此，我们提出V-Fold——一种自适应历史感知压缩方案，该方案以高保真度保留近期对话轮次，同时通过渲染将历史上下文折叠至视觉空间；（iii）我们开发了POINTS-Seeker-8B模型，这是一个先进的多模态具身搜索模型。在六个多样化基准测试中，该模型均持续优于现有模型，有效解决了长程、知识密集型视觉推理的挑战。

摘要 (Abstract)

While Large Multimodal Models (LMMs) demonstrate impressive visual perception, they remain epistemically constrained by their static parametric knowledge. To transcend these boundaries, multimodal search models have been adopted to actively interact with the external environment for evidence retrieval. Diverging from prevailing paradigms that merely retrofit general LMMs with search tools as modular extensions, we explore the potential of building a multimodal agentic search model from scratch. Specifically, we make the following contributions: (i) we introduce Agentic Seeding, a dedicated phase designed to weave the foundational precursors necessary for eliciting agentic behaviors; (ii) we uncover a performance bottleneck in long-horizon interactions, where the increasing volume of interaction history overwhelms the model’s ability to locate ground-truth evidence. To mitigate this, we propose V-Fold, an adaptive history-aware compression scheme that preserves recent dialogue turns in high fidelity while folding historical context into the visual space via rendering; and (iii) we develop POINTS-Seeker-8B, a state-of-the-art multimodal agentic search model that consistently outperforms existing models across six diverse benchmarks, effectively resolving the challenges of long-horizon, knowledge-intensive visual reasoning.

关键词: multimodal agentic search model, Agentic Seeding, V-Fold compression, long-horizon interactions, visual reasoning, evidence retrieval, POINTS-Seeker-8B, interaction history

165. ❌ Depth-Aware Image and Video Orientation Estimation

作者: Muhammad Z. Alam, Larry Stetsiuk, M. Umair Mukati, Zeeshan Kaleem 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13995v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉中的图像/视频方向估计，利用深度分布和深度梯度一致性等技术，属于传统计算机视觉任务，未涉及大语言模型、深度学习技术原理创新或AI for Science等关键词领域。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于深度分布和深度梯度一致性的图像和视频方向估计新方法，在VR/AR、自主导航等应用中表现出优于现有技术的鲁棒性和准确性。

摘要翻译

本文提出了一种利用自然图像深度分布进行图像与视频方向估计的新方法。该方法通过分析图像不同象限的深度分布来估计方向，为虚拟现实（VR）、增强现实（AR）、自主导航和交互式监控系统等应用提供了一个鲁棒的方向估计框架。为进一步增强细粒度感知对齐，我们引入了深度梯度一致性（Depth Gradient Consistency, DGC）与水平对称性分析（Horizontal Symmetry Analysis, HSA），以实现精确的方向校正。这一混合策略有效利用深度线索，保障了沉浸式视觉内容的空间连贯性与感知稳定性。定性与定量评估表明，所提方法在不同场景下均展现出优于现有技术的鲁棒性与准确性。

摘要 (Abstract)

This paper introduces a novel approach for image and video orientation estimation by leveraging depth distribution in natural images. The proposed method estimates the orientation based on the depth distribution across different quadrants of the image, providing a robust framework for orientation estimation suited for applications such as virtual reality (VR), augmented reality (AR), autonomous navigation, and interactive surveillance systems. To further enhance fine-scale perceptual alignment, we incorporate depth gradient consistency (DGC) and horizontal symmetry analysis (HSA), enabling precise orientation correction. This hybrid strategy effectively exploits depth cues to support spatial coherence and perceptual stability in immersive visual content. Qualitative and quantitative evaluations demonstrate the robustness and accuracy of the proposed approach, outperforming existing techniques across diverse scenarios.

关键词: orientation estimation, depth distribution, depth gradient consistency, horizontal symmetry analysis, virtual reality, augmented reality, autonomous navigation, surveillance systems

166. ❌ Remote Sensing Image Super-Resolution for Imbalanced Textures: A Texture-Aware Diffusion Framework

作者: Enzhuo Zhang, Sijie Zhao, Dilxat Muhtar, Zhenshi Li, Xueliang Zhang, Pengfeng Xiao 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13994v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于遥感图像超分辨率，提出了一种基于扩散模型的纹理感知框架（TexADiff），核心贡献在于解决遥感图像中纹理分布不平衡的问题。所有评分关键词均与大语言模型（LLMs）、深度学习技术原理创新或AI在科学领域的应用直接相关，而本文研究的是计算机视觉中的图像超分辨率任务，使用生成扩散模型，并未涉及大语言模型、MoE、缩放定律、对齐、推理、代理、量化等关键词所描述的技术或应用领域。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

本文针对遥感图像中纹理分布不平衡的问题，提出了一种纹理感知的扩散框架（TexADiff），通过估计相对纹理密度图来引导扩散过程，从而在超分辨率任务中生成更忠实的高频细节并抑制纹理幻觉，提升了重建质量和下游任务性能。

摘要翻译

生成式扩散先验模型近期在自然图像超分辨率领域取得了最先进的性能，展现出合成逼真细节的强大能力。然而，将其直接应用于遥感图像超分辨率（RSISR）时，却暴露出显著缺陷。与自然图像不同，遥感图像呈现出独特的纹理分布：地物在全局上具有随机性，而在局部则呈现聚集性，导致纹理高度不平衡。这种不平衡严重阻碍了模型的空间感知能力。为解决这一问题，我们提出了TexADiff，一种新颖的框架。该框架首先通过估算相对纹理密度图（Relative Texture Density Map, RTDM）来表征纹理分布。随后，TexADiff以三种协同方式利用此RTDM：作为显式的空间条件来引导扩散过程，作为损失调制项以优先处理纹理丰富区域，以及作为采样策略的动态适配器。这些改进旨在赋予模型显式的纹理感知能力。实验表明，TexADiff取得了优越或具有竞争力的量化指标。此外，定性结果显示，我们的模型能够生成忠实的高频细节，同时有效抑制纹理幻觉。这种重建质量的提升也带来了下游任务性能的显著增益。本方法的源代码可在 https://github.com/ZezFuture/TexAdiff 获取。

摘要 (Abstract)

Generative diffusion priors have recently achieved state-of-the-art performance in natural image super-resolution, demonstrating a powerful capability to synthesize photorealistic details. However, their direct application to remote sensing image super-resolution (RSISR) reveals significant shortcomings. Unlike natural images, remote sensing images exhibit a unique texture distribution where ground objects are globally stochastic yet locally clustered, leading to highly imbalanced textures. This imbalance severely hinders the model’s spatial perception. To address this, we propose TexADiff, a novel framework that begins by estimating a Relative Texture Density Map (RTDM) to represent the texture distribution. TexADiff then leverages this RTDM in three synergistic ways: as an explicit spatial conditioning to guide the diffusion process, as a loss modulation term to prioritize texture-rich regions, and as a dynamic adapter for the sampling schedule. These modifications are designed to endow the model with explicit texture-aware capabilities. Experiments demonstrate that TexADiff achieves superior or competitive quantitative metrics. Furthermore, qualitative results show that our model generates faithful high-frequency details while effectively suppressing texture hallucinations. This improved reconstruction quality also results in significant gains in downstream task performance. The source code of our method can be found at https://github.com/ZezFuture/TexAdiff.

关键词: remote sensing image super-resolution, texture imbalance, diffusion model, texture-aware framework, relative texture density map, high-frequency details, texture hallucination suppression, downstream task performance

167. ❌ HiProto: Hierarchical Prototype Learning for Interpretable Object Detection Under Low-quality Conditions

作者: Jianlin Xiang, Linhui Dai, Xue Yang, Chaolei Yang, Yanshan Li 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13981v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉领域的可解释目标检测，提出了一种基于分层原型学习的新方法HiProto，通过原型对比损失、原型正则化损失和尺度感知伪标签生成策略来提升低质量条件下的检测性能和可解释性。所有关键词均与大语言模型、深度学习技术原理或科学AI应用相关，但本文研究的是传统的计算机视觉任务（目标检测），未涉及大模型、深度学习技术原理创新或科学领域应用。唯一有微弱关联的是’Mechanistic Interpretability OR Explainable AI’，因为论文提到了可解释性，但这是针对计算机视觉模型的可解释性，而非大语言模型的可解释性，因此给5分（有一定关联）。其他关键词均完全无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于分层原型学习的可解释目标检测方法HiProto，解决了低质量成像条件下目标检测系统缺乏可解释性和语义区分能力的问题，在多个数据集上实现了竞争性性能并提供了清晰的原型响应可解释性。

摘要翻译

可解释性对于在关键应用中部署目标检测系统至关重要，尤其是在低质量成像条件下，这些条件会削弱视觉信息并增加预测不确定性。现有方法要么提升图像质量，要么设计复杂架构，但往往缺乏可解释性，且未能改善语义区分能力。相比之下，原型学习通过将特征与以类别为中心的语义关联起来，实现了可解释建模，能够在图像退化条件下提供更稳定且可解释的表征。受此启发，我们提出了HiProto，一种基于分层原型学习的可解释目标检测新范式。通过跨多个特征层级构建结构化的原型表征，HiProto有效建模了类别特定的语义，从而同时增强了语义区分能力和可解释性。基于原型建模，我们首先提出了区域到原型对比损失（Region-to-Prototype Contrastive Loss, RPC-Loss），以增强原型对目标区域的语义聚焦。接着，我们提出了原型正则化损失（Prototype Regularization Loss, PR-Loss），以提高不同类别原型之间的区分度。最后，我们提出了尺度感知伪标签生成策略（Scale-aware Pseudo Label Generation Strategy, SPLGS），以抑制RPC-Loss中不匹配的监督信号，从而保持低层级原型表征的鲁棒性。在ExDark、RTTS和VOC2012-FOG数据集上的实验表明，HiProto取得了具有竞争力的结果，同时通过原型响应提供了清晰的可解释性，且无需依赖图像增强或复杂架构。我们的代码将在https://github.com/xjlDestiny/HiProto.git 公开。

摘要 (Abstract)

Interpretability is essential for deploying object detection systems in critical applications, especially under low-quality imaging conditions that degrade visual information and increase prediction uncertainty. Existing methods either enhance image quality or design complex architectures, but often lack interpretability and fail to improve semantic discrimination. In contrast, prototype learning enables interpretable modeling by associating features with class-centered semantics, which can provide more stable and interpretable representations under degradation. Motivated by this, we propose HiProto, a new paradigm for interpretable object detection based on hierarchical prototype learning. By constructing structured prototype representations across multiple feature levels, HiProto effectively models class-specific semantics, thereby enhancing both semantic discrimination and interpretability. Building upon prototype modeling, we first propose a Region-to-Prototype Contrastive Loss (RPC-Loss) to enhance the semantic focus of prototypes on target regions. Then, we propose a Prototype Regularization Loss (PR-Loss) to improve the distinctiveness among class prototypes. Finally, we propose a Scale-aware Pseudo Label Generation Strategy (SPLGS) to suppress mismatched supervision for RPC-Loss, thereby preserving the robustness of low-level prototype representations. Experiments on ExDark, RTTS, and VOC2012-FOG demonstrate that HiProto achieves competitive results while offering clear interpretability through prototype responses, without relying on image enhancement or complex architectures. Our code will be available at https://github.com/xjlDestiny/HiProto.git.

关键词: interpretable object detection, prototype learning, low-quality conditions, hierarchical prototype learning, semantic discrimination, region-to-prototype contrastive loss, prototype regularization loss, scale-aware pseudo label generation

168. ❌ MApLe: Multi-instance Alignment of Diagnostic Reports and Large Medical Images

作者: Felicia Bader, Philipp Seeböck, Anastasia Bartashova, Ulrike Attenberger, Georg Langs 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13970v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文MApLe专注于医学影像诊断报告与图像的细粒度对齐，属于AI在生物医学领域的应用。它提出了一种多任务、多实例的视觉语言对齐方法，核心是医学图像分析和自然语言处理在医疗诊断中的交叉应用。论文未涉及大模型技术原理、训练方法、推理优化、代理系统等通用AI技术，因此除’AI for Science OR Bioinformatics OR Cheminformatics’（高度相关，得10分）外，其他所有关键词均完全无关（得0分）。

!!! tip deepseek-chat TL;DR

该研究解决了医学诊断报告中文本描述与医学图像中细小病灶区域难以精确对齐的问题，提出了一种多实例视觉语言对齐方法MApLe，显著提升了对齐性能。

摘要翻译

在诊断报告中，专家将复杂的影像数据编码为具有临床指导意义的信息。他们描述在解剖学背景下具有意义的细微病理发现。报告遵循相对一致的结构，用少量文字表达诊断信息，这些文字常与微小但关键的影像观察结果相关联。标准的视觉语言模型难以识别这些信息丰富的文本成分与图像中微小区域之间的关联。本文提出“MApLe”，一种多任务、多实例的视觉语言对齐方法，以克服这些限制。该方法解耦了解剖区域与诊断发现的概念，并通过分块处理的方式将局部图像信息与句子相关联。我们的方法包括：一个经过训练以捕捉句子中解剖和诊断概念的文本嵌入模型，一个基于解剖结构进行条件编码的分块图像编码器，以及对这些表示进行多实例对齐的机制。我们证明，MApLe能够成功对齐自由文本报告中不同的图像区域和多个诊断发现。实验表明，在多项下游任务评估中，我们的模型相较于现有先进基线模型提升了对齐性能。代码发布于https://github.com/cirmuw/MApLe。

摘要 (Abstract)

In diagnostic reports, experts encode complex imaging data into clinically actionable information. They describe subtle pathological findings that are meaningful in their anatomical context. Reports follow relatively consistent structures, expressing diagnostic information with few words that are often associated with tiny but consequential image observations. Standard vision language models struggle to identify the associations between these informative text components and small locations in the images. Here, we propose “MApLe”, a multi-task, multi-instance vision language alignment approach that overcomes these limitations. It disentangles the concepts of anatomical region and diagnostic finding, and links local image information to sentences in a patch-wise approach. Our method consists of a text embedding trained to capture anatomical and diagnostic concepts in sentences, a patch-wise image encoder conditioned on anatomical structures, and a multi-instance alignment of these representations. We demonstrate that MApLe can successfully align different image regions and multiple diagnostic findings in free-text reports. We show that our model improves the alignment performance compared to state-of-the-art baseline models when evaluated on several downstream tasks. The code is available at https://github.com/cirmuw/MApLe.

关键词: medical imaging, diagnostic reports, vision-language alignment, multi-instance learning, patch-wise approach, anatomical region, diagnostic finding, downstream tasks

169. ❌ Heuristic Style Transfer for Real-Time, Efficient Weather Attribute Detection

作者: Hamed Ouattara, Pierre Duthon, Pascal Houssam Salmane, Frédéric Bernardin, Omar Ait Aider 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13947v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的天气属性检测，使用轻量级CNN架构（如ResNet-50、PatchGAN）和风格迁移技术（Gram矩阵），属于传统的图像分类和多任务学习研究。所有评分关键词均涉及大语言模型（LLM）及其相关技术（如MoE、RLHF、RAG、量化等），或特定于LLM的应用（如AI for Science）。论文未提及任何语言模型、Transformer架构或LLM相关方法，与所有关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于风格迁移的轻量级多任务架构，用于从RGB图像中实时检测天气类型和属性，在内部测试集上F1分数超过96%，并在外部数据集上展示了零样本泛化能力。

摘要翻译

本文提出了一种轻量高效的架构，用于从RGB图像中检测天气状况，可预测天气类型（晴天、雨、雪、雾）及强度、能见度、地面状况等11种互补属性，任务总计涵盖53个类别。本研究探讨了天气条件在何种程度上表现为视觉风格的变化。我们在采用注意力机制的多任务框架内，研究了包括格拉姆矩阵（Gram matrices）、针对中低层特征的截断式ResNet-50（truncated ResNet-50）以及PatchGAN风格架构在内的多种风格启发技术。我们引入了两个系列模型：RTM（ResNet50截断多任务）和PMG（PatchGAN多任务格拉姆）及其变体。主要贡献包括：实现了格拉姆矩阵计算的自动化，将PatchGAN整合到有监督的多任务学习中，以及通过局部格拉姆矩阵（local Gram）捕捉局部风格以提升空间一致性。同时，我们发布了一个包含503,875张图像的数据集，这些图像根据知识共享署名许可（Creative Commons Attribution, CC-BY）标注了12种天气属性。模型在内部测试集上取得了超过96%的F1分数，并在多个外部数据集上的零样本评估中达到78%以上，证实了其泛化能力。PMG架构参数量不足500万，可实时运行且内存占用小，适用于嵌入式系统。模型的模块化设计也允许根据需要添加或移除与风格或天气相关的任务。

摘要 (Abstract)

We present lightweight and efficient architectures to detect weather conditions from RGB images, predicting the weather type (sunny, rain, snow, fog) and 11 complementary attributes such as intensity, visibility, and ground condition, for a total of 53 classes across the tasks. This work examines to what extent weather conditions manifest as variations in visual style. We investigate style-inspired techniques, including Gram matrices, a truncated ResNet-50 targeting lower and intermediate layers, and PatchGAN-style architectures, within a multi-task framework with attention mechanisms. Two families are introduced: RTM (ResNet50-Truncated-MultiTasks) and PMG (PatchGAN-MultiTasks-Gram), together with their variants. Our contributions include automation of Gram-matrix computation, integration of PatchGAN into supervised multi-task learning, and local style capture through local Gram for improved spatial coherence. We also release a dataset of 503,875 images annotated with 12 weather attributes under a Creative Commons Attribution (CC-BY) license. The models achieve F1 scores above 96 percent on our internal test set and above 78 percent in zero-shot evaluation on several external datasets, confirming their generalization ability. The PMG architecture, with fewer than 5 million parameters, runs in real time with a small memory footprint, making it suitable for embedded systems. The modular design of the models also allows style-related or weather-related tasks to be added or removed as needed.

关键词: weather attribute detection, style transfer, Gram matrices, multi-task learning, real-time processing, lightweight architecture, PatchGAN, zero-shot evaluation

170. ❌ SceneGlue: Scene-Aware Transformer for Feature Matching without Scene-Level Annotation

作者: Songlin Du, Xiaoyong Lu, Yaping Yan, Guobao Xiao, Xiaobo Lu, Takeshi Ikenaga 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13941v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《SceneGlue: Scene-Aware Transformer for Feature Matching without Scene-Level Annotation》专注于计算机视觉领域的局部特征匹配问题，提出了一种结合隐式并行注意力和显式跨视图可见性估计的场景感知Transformer框架。论文的核心技术是Transformer架构在视觉特征匹配中的应用，以及一种无需场景级标注的训练方法。所有评分关键词均与大语言模型（LLMs）、深度学习技术原理创新或AI在科学领域的应用直接相关，而本论文研究的是传统的计算机视觉任务（图像匹配、位姿估计等），并未涉及大语言模型、MoE、缩放定律、对齐、RAG、推理加速、AI for Science等任何指定关键词领域。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为SceneGlue的场景感知Transformer框架，通过结合隐式并行注意力和显式可见性估计来解决跨视图图像局部特征匹配的局限性，在无需场景级标注的情况下显著提升了匹配精度和鲁棒性。

摘要翻译

局部特征匹配在理解跨视角图像对应关系中起着关键作用。然而，传统方法受限于特征描述符固有的局部性，难以捕捉对准确跨视角匹配至关重要的非局部场景信息。本文提出SceneGlue，一种旨在克服这些局限性的场景感知特征匹配框架。SceneGlue采用一种可混合的匹配范式，整合了隐式并行注意力与显式跨视角可见性估计。并行注意力机制同时在图像内部及图像间的局部描述符之间交换信息，从而增强场景的全局上下文。为进一步丰富场景感知能力，我们提出可见性变换器（Visibility Transformer），其显式地将特征分类为可见与不可见区域，以理解跨视角的场景可见性。通过结合显式与隐式的场景级感知，SceneGlue有效弥补了局部描述符的局限性。值得注意的是，SceneGlue仅使用局部特征匹配进行训练，无需场景级真实标注。与传统方法相比，这种场景感知方法不仅提升了准确性与鲁棒性，还增强了可解释性。在单应性估计、姿态估计、图像匹配和视觉定位等应用上的大量实验验证了SceneGlue的优越性能。源代码发布于https://github.com/songlin-du/SceneGlue。

摘要 (Abstract)

Local feature matching plays a critical role in understanding the correspondence between cross-view images. However, traditional methods are constrained by the inherent local nature of feature descriptors, limiting their ability to capture non-local scene information that is essential for accurate cross-view correspondence. In this paper, we introduce SceneGlue, a scene-aware feature matching framework designed to overcome these limitations. SceneGlue leverages a hybridizable matching paradigm that integrates implicit parallel attention and explicit cross-view visibility estimation. The parallel attention mechanism simultaneously exchanges information among local descriptors within and across images, enhancing the scene’s global context. To further enrich the scene awareness, we propose the Visibility Transformer, which explicitly categorizes features into visible and invisible regions, providing an understanding of cross-view scene visibility. By combining explicit and implicit scene-level awareness, SceneGlue effectively compensates for the local descriptor constraints. Notably, SceneGlue is trained using only local feature matches, without requiring scene-level groundtruth annotations. This scene-aware approach not only improves accuracy and robustness but also enhances interpretability compared to traditional methods. Extensive experiments on applications such as homography estimation, pose estimation, image matching, and visual localization validate SceneGlue’s superior performance. The source code is available at https://github.com/songlin-du/SceneGlue.

关键词: Scene-Aware Transformer, Feature Matching, Cross-view Images, Parallel Attention, Visibility Transformer, Local Feature Descriptors, Homography Estimation, Visual Localization

171. ❌ A Multi-Stage Optimization Pipeline for Bethesda Cell Detection in Pap Smear Cytology

作者: Martin Amster, Camila María Polotto 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13939v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于医学图像分析（Pap smear细胞检测），使用YOLO和U-Net等传统计算机视觉模型，未涉及大语言模型、深度学习技术原理创新或任何评分关键词中的大模型相关技术。仅与’AI for Science OR Bioinformatics OR Cheminformatics’有弱关联（属于生物医学AI应用），但论文未明确提及这些术语，且创新性在于模型集成而非大模型技术，因此该关键词给5分（有一定关联），其余关键词均为0分（完全无关）。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于YOLO和U-Net集成的多阶段优化框架，用于Pap smear细胞学中的Bethesda细胞检测，在ISBI竞赛中获得第二名，mAP50-95得分为0.5909。

摘要翻译

近年来，计算机视觉技术取得了显著进展，在医学领域得到了多样且具有影响力的应用。本文提出了一种用于检测巴氏涂片图像中贝塞斯达细胞（Bethesda cells）的新框架，该框架是为与国际生物医学成像研讨会（ISBI）联合举办的Riva细胞学挑战赛（Riva Cytology Challenge）的B赛道所开发。本研究的重点在于提升用于细胞检测的计算机视觉模型，其性能采用mAP50-95指标进行评估。我们提出了一种基于YOLO与U-Net架构集成，并辅以利用重叠去除技术和二元分类器的细化阶段的解决方案。我们的框架在竞赛中以0.5909的mAP50-95分数获得了第二名。实现方法与源代码可在以下代码库获取：github.com/martinamster/riva-trackb。

摘要 (Abstract)

Computer vision techniques have advanced significantly in recent years, finding diverse and impactful applications within the medical field. In this paper, we introduce a new framework for the detection of Bethesda cells in Pap smear images, developed for Track B of the Riva Cytology Challenge held in association with the International Symposium on Biomedical Imaging (ISBI). This work focuses on enhancing computer vision models for cell detection, with performance evaluated using the mAP50-95 metric. We propose a solution based on an ensemble of YOLO and U-Net architectures, followed by a refinement stage utilizing overlap removal techniques and a binary classifier. Our framework achieved second place with a mAP50-95 score of 0.5909 in the competition. The implementation and source code are available at the following repository: github.com/martinamster/riva-trackb

关键词: Bethesda cell detection, Pap smear cytology, computer vision, YOLO, U-Net, ensemble model, mAP50-95, medical imaging

172. ❌ ASTRA: Enhancing Multi-Subject Generation with Retrieval-Augmented Pose Guidance and Disentangled Position Embedding

作者: Tianze Xia, Zijian Ning, Zonglin Zhao, Mingjia Wang 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13938v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	8.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉领域的多主体图像生成，核心创新是检索增强的姿势引导和位置嵌入解耦技术。与绝大多数大语言模型（LLM）相关的关键词（如LLMs、MoE、Scaling Laws、Instruction Tuning、RLHF、PEFT、Context Window、KV Cache、CoT、Agents、Quantization等）完全无关。唯一相关的关键词是’Retrieval-Augmented Generation OR RAG OR Retrieval-Generation’，因为论文明确使用了’Retrieval-Augmented Pose (RAG-Pose)‘管道，这是一种检索增强技术，尽管应用于图像生成而非文本生成，因此给予8分（有一定关联，但非核心LLM应用）。其他关键词如’AI for Science’等与论文的计算机视觉应用不直接相关。

!!! tip deepseek-chat TL;DR

该论文解决了多主体图像生成中身份保持与姿势精确控制之间的冲突，通过提出ASTRA框架，结合检索增强姿势引导和解耦位置嵌入，实现了在复杂姿势下同时保持高身份保真度和姿势准确性的最先进性能。

摘要翻译

主题驱动的图像生成在创建个性化内容方面已展现出巨大成功，但其能力主要局限于常见姿态下的单一主体。现有方法在处理具有复杂、差异化动作的多个主体时面临一个根本性冲突：在保持个体身份的同时强制执行精确的姿态结构。由于外观与结构信号在模型架构中相互纠缠，这一挑战常导致身份融合与姿态失真。为解决此冲突，我们提出了ASTRA（基于定向检索增强的自适应合成框架），这是一种在统一扩散Transformer架构内将主体外观与姿态结构解耦的新型框架。ASTRA通过双管齐下的策略实现这一目标：首先采用检索增强姿态（RAG-Pose）流程，从精选数据库中提供清晰、明确的结构先验；随后，其核心生成模型通过我们提出的增强通用旋转位置编码（EURoPE）——一种非对称编码机制——学习处理这些双重视觉条件，该机制将身份标记与空间位置解耦，同时将姿态标记绑定至画布。与此同时，解耦语义调制（DSM）适配器将身份保持任务分流至文本条件流中。大量实验表明，我们的集成方法实现了卓越的解耦效果。在我们设计的基于COCO的复杂姿态基准测试中，ASTRA在姿态遵循度上达到了新的最优水平，同时在DreamBench中保持了高身份保真度与文本对齐度。

摘要 (Abstract)

Subject-driven image generation has shown great success in creating personalized content, but its capabilities are largely confined to single subjects in common poses. Current approaches face a fundamental conflict when handling multiple subjects with complex, distinct actions: preserving individual identities while enforcing precise pose structures. This challenge often leads to identity fusion and pose distortion, as appearance and structure signals become entangled within the model’s architecture. To resolve this conflict, we introduce ASTRA(Adaptive Synthesis through Targeted Retrieval Augmentation), a novel framework that architecturally disentangles subject appearance from pose structure within a unified Diffusion Transformer. ASTRA achieves this through a dual-pronged strategy. It first employs a Retrieval-Augmented Pose (RAG-Pose) pipeline to provide a clean, explicit structural prior from a curated database. Then, its core generative model learns to process these dual visual conditions using our Enhanced Universal Rotary Position Embedding (EURoPE), an asymmetric encoding mechanism that decouples identity tokens from spatial locations while binding pose tokens to the canvas. Concurrently, a Disentangled Semantic Modulation (DSM) adapter offloads the identity preservation task into the text conditioning stream. Extensive experiments demonstrate that our integrated approach achieves superior disentanglement. On our designed COCO-based complex pose benchmark, ASTRA achieves a new state-of-the-art in pose adherence, while maintaining high identity fidelity and text alignment in DreamBench.

关键词: subject-driven image generation, multi-subject generation, retrieval-augmented pose guidance, disentangled position embedding, Diffusion Transformer, identity preservation, pose adherence, ASTRA framework

173. ❌ PartNerFace: Part-based Neural Radiance Fields for Animatable Facial Avatar Reconstruction

作者: Xianggang Yu, Lingteng Qiu, Xiaohang Ren, Guanying Chen, Shuguang Cui, Xiaoguang Han, Baoyuan Wang 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13918v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是基于神经辐射场（NeRF）的面部头像重建和动画技术，属于计算机视觉和图形学领域。论文内容完全不涉及大语言模型（LLMs）、深度学习技术原理创新、或大模型在不同领域的应用。所有评分关键词均与大语言模型、深度学习技术原理、或AI在科学领域的应用相关，而本文专注于传统的计算机视觉任务（3D重建和动画），因此所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于部分的神经辐射场方法（PartNerFace），用于从单目RGB视频重建可动画的面部头像，通过逆蒙皮和基于部分的变形场来泛化到未见过的面部表情并捕捉精细运动细节，在定量和定性评估中均优于现有方法。

摘要翻译

本文提出PartNerFace，一种基于局部的神经辐射场方法，用于从单目RGB视频中重建可动画化的面部化身。现有解决方案要么仅通过形变模型参数对隐式网络进行条件约束，要么学习一个假想的规范辐射场，导致其难以泛化到未见过的面部表情并捕捉精细的运动细节。为解决这些挑战，我们首先基于参数化头部模型应用逆向蒙皮，将观测点映射至规范空间，随后通过基于局部的形变场对精细运动进行建模。我们的核心见解在于：不同面部区域的形变应当采用不同的建模方式。具体而言，我们的基于局部的形变场由多个局部多层感知机（MLPs）组成，以自适应地将规范空间划分为不同区域，其中三维点的形变通过软加权机制聚合所有局部MLPs的预测结果来计算。大量实验表明，我们的方法能够很好地泛化到未见过的表情，并能有效建模精细的面部运动，在定量与定性评估上均优于现有先进方法。

摘要 (Abstract)

We present PartNerFace, a part-based neural radiance fields approach, for reconstructing animatable facial avatar from monocular RGB videos. Existing solutions either simply condition the implicit network with the morphable model parameters or learn an imaginary canonical radiance field, making them fail to generalize to unseen facial expressions and capture fine-scale motion details. To address these challenges, we first apply inverse skinning based on a parametric head model to map an observed point to the canonical space, and then model fine-scale motions with a part-based deformation field. Our key insight is that the deformation of different facial parts should be modeled differently. Specifically, our part-based deformation field consists of multiple local MLPs to adaptively partition the canonical space into different parts, where the deformation of a 3D point is computed by aggregating the prediction of all local MLPs by a soft-weighting mechanism. Extensive experiments demonstrate that our method generalizes well to unseen expressions and is capable of modeling fine-scale facial motions, outperforming state-of-the-art methods both quantitatively and qualitatively.

关键词: neural radiance fields, facial avatar reconstruction, animatable avatar, part-based deformation, inverse skinning, monocular RGB video, fine-scale motion, parametric head model

作者: Shuyun Wang, Hu Zhang, Xin Shen, Dadong Wang, Xin Yu 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13906v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《Blind Bitstream-corrupted Video Recovery via Metadata-guided Diffusion Model》专注于视频恢复的计算机视觉任务，提出了一种基于扩散模型和元数据引导的方法来解决比特流损坏视频的盲恢复问题。论文的核心技术是扩散模型在视频处理中的应用，以及利用运动向量和帧类型等视频元数据作为损坏指示器。所有评分关键词均与大语言模型（LLMs）、深度学习技术原理创新、或AI在科学领域的应用直接相关，而本文研究的是视频恢复的特定计算机视觉问题，未涉及任何大语言模型技术、深度学习原理创新、或AI在科学（如生物信息学）中的应用。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究比特流损坏视频的盲恢复问题，提出了一种基于元数据引导的扩散模型（M-GDM），通过利用视频元数据作为损坏指示器并设计伪掩码预测器，有效恢复了受损视频内容，在盲视频恢复任务中表现出优越性能。

摘要翻译

比特流受损视频恢复旨在修复在视频存储或传输过程中受损的真实内容。现有方法通常假设已提供受损区域的预定义掩码，但在实际场景中手动标注这些掩码既费力又不切实际。为解决这一局限，我们引入了一种新的盲视频恢复设置，消除了对预定义掩码的依赖。该设置面临两大挑战：准确识别受损区域以及从广泛且不规则的退化中恢复内容。我们提出了一种元数据引导的扩散模型（Metadata-Guided Diffusion Model, M-GDM）来应对这些挑战。具体而言，通过双流元数据编码器将视频固有元数据作为损坏指示器，该编码器分别嵌入运动向量和帧类型，再将其融合为统一表示。该表示在扩散过程的每一步中通过交叉注意力与受损的潜在特征交互。为保护完整区域，我们设计了一种先验驱动的掩码预测器，利用元数据和扩散先验生成伪掩码，通过硬掩码实现完整区域与恢复区域的分离与重组。为减轻因掩码不完美导致的边界伪影，后优化模块增强了完整区域与恢复区域之间的一致性。大量实验证明了我们方法的有效性及其在盲视频恢复中的优越性。代码发布于：https://github.com/Shuyun-Wang/M-GDM。

摘要 (Abstract)

Bitstream-corrupted video recovery aims to restore realistic content degraded during video storage or transmission. Existing methods typically assume that predefined masks of corrupted regions are available, but manually annotating these masks is labor-intensive and impractical in real-world scenarios. To address this limitation, we introduce a new blind video recovery setting that removes the reliance on predefined masks. This setting presents two major challenges: accurately identifying corrupted regions and recovering content from extensive and irregular degradations. We propose a Metadata-Guided Diffusion Model (M-GDM) to tackle these challenges. Specifically, intrinsic video metadata are leveraged as corruption indicators through a dual-stream metadata encoder that separately embeds motion vectors and frame types before fusing them into a unified representation. This representation interacts with corrupted latent features via cross-attention at each diffusion step. To preserve intact regions, we design a prior-driven mask predictor that generates pseudo masks using both metadata and diffusion priors, enabling the separation and recombination of intact and recovered regions through hard masking. To mitigate boundary artifacts caused by imperfect masks, a post-refinement module enhances consistency between intact and recovered regions. Extensive experiments demonstrate the effectiveness of our method and its superiority in blind video recovery. Code is available at: https://github.com/Shuyun-Wang/M-GDM.

关键词: blind video recovery, bitstream-corrupted video, diffusion model, metadata-guided, motion vectors, frame types, mask predictor, post-refinement

175. ❌ Rethinking Image-to-3D Generation with Sparse Queries: Efficiency, Capacity, and Input-View Bias

作者: Zhiyuan Xu, Jiuming Liu, Yuxin Chen, Masayoshi Tomizuka, Chenfeng Xu, Chensheng Peng 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13905v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	5.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种用于图像到3D生成的新型稀疏查询框架SparseGen，主要涉及计算机视觉和3D生成领域。与大多数关键词（主要针对大语言模型及其相关技术）无关。唯一的相关性是“Mixture of Experts OR MoE OR Sparse Models”，因为论文的核心创新是使用稀疏查询（sparse queries）和稀疏集合潜在扩展（sparse set-latent expansion）来高效建模3D场景，这与“稀疏模型”的概念在广义上相关，但并非专门针对MoE架构。因此，该关键词得5分（有一定关联），其余关键词得0分（完全无关）。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为SparseGen的新型稀疏查询框架，用于高效、低偏见的图像到3D生成，通过稀疏3D锚点查询和扩展操作显著减少了内存使用和推理时间，同时保持了多视图保真度。

摘要翻译

本文提出SparseGen——一种高效图像到三维生成的新型框架，该框架在显著提升生成速度的同时展现出较低输入视角偏差。与传统依赖密集体素网格、三平面或像素对齐基元的方法不同，我们采用紧凑的稀疏学习三维锚点查询集合与学习扩展算子对场景进行建模，该算子可将每个变换后的查询解码为局部小型三维高斯基元集合。在无三维监督的修正流重建目标训练下，我们的模型学会在几何与外观关键区域分配表征容量，在保持多视角保真度的同时显著降低内存消耗与推理时间。我们引入输入视角偏差与利用率的量化指标，证明稀疏查询能够减少对条件视角的过拟合，同时保持表征高效性。实验结果表明，稀疏集合潜在扩展是高效三维生成建模中一种具有理论依据且实用的替代方案。

摘要 (Abstract)

We present SparseGen, a novel framework for efficient image-to-3D generation, which exhibits low input-view bias while being significantly faster. Unlike traditional approaches that rely on dense volumetric grids, triplanes, or pixel-aligned primitives, we model scenes with a compact sparse set of learned 3D anchor queries and a learned expansion operator that decodes each transformed query into a small local set of 3D Gaussian primitives. Trained under a rectified-flow reconstruction objective without 3D supervision, our model learns to allocate representation capacity where geometry and appearance matter, achieving significant reductions in memory and inference time while preserving multi-view fidelity. We introduce quantitative measures of input-view bias and utilization to show that sparse queries reduce overfitting to conditioning views while being representationally efficient. Our results argue that sparse set-latent expansion is a principled, practical alternative for efficient 3D generative modeling.

关键词: image-to-3D generation, sparse queries, 3D Gaussian primitives, efficiency, input-view bias, rectified-flow reconstruction, SparseGen, representation capacity

176. ❌ Context Sensitivity Improves Human-Machine Visual Alignment

作者: Frieda Born, Tom Neuhäuser, Lukas Muttenthaler, Brett D. Roads, Bernhard Spitzer, Andrew K. Lampinen, Matt Jones, Klaus-Robert Müller, Michael C. Mozer 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13883v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究计算机视觉中的上下文敏感相似性计算，旨在改善人类与机器视觉对齐。论文核心是视觉基础模型和人类对齐，与关键词’Instruction Tuning OR Alignment OR Value Alignment’有一定关联（5分），因为涉及人类对齐概念。其他关键词主要针对大语言模型（LLMs）、训练技术、推理方法、代理系统等，而本文专注于视觉模型和人类认知对齐，与这些技术领域无直接关联，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种从神经网络嵌入中计算上下文敏感相似性的方法，用于建模三元组奇数任务，通过引入上下文实现了比上下文不敏感模型高达15%的准确率提升，并在原始和人类对齐的视觉基础模型中表现一致。

摘要翻译

现代机器学习模型通常将输入表示为高维嵌入空间中的固定点。尽管这种方法已被证明在广泛的下游任务中具有强大性能，但其本质上与人类处理信息的方式存在差异。由于人类持续适应环境，他们以高度上下文敏感的方式表征对象及其关系。为弥合这一差距，我们提出一种基于神经网络嵌入的上下文敏感相似度计算方法，并将其应用于建模以锚定图像作为同步上下文的三元组异常项识别任务。通过引入上下文建模，我们在异常项识别准确率上相比非上下文敏感模型实现了高达15%的提升。研究发现，这一改进在原始视觉基础模型与“人类对齐”视觉基础模型中均保持一致。

摘要 (Abstract)

Modern machine learning models typically represent inputs as fixed points in a high-dimensional embedding space. While this approach has been proven powerful for a wide range of downstream tasks, it fundamentally differs from the way humans process information. Because humans are constantly adapting to their environment, they represent objects and their relationships in a highly context-sensitive manner. To address this gap, we propose a method for context-sensitive similarity computation from neural network embeddings, applied to modeling a triplet odd-one-out task with an anchor image serving as simultaneous context. Modeling context enables us to achieve up to a 15% improvement in odd-one-out accuracy over a context-insensitive model. We find that this improvement is consistent across both original and “human-aligned” vision foundation models.

关键词: context-sensitive similarity, neural network embeddings, human-machine alignment, vision foundation models, odd-one-out task, context modeling, human-aligned models

177. ❌ PostureObjectstitch: Anomaly Image Generation Considering Assembly Relationships in Industrial Scenarios

作者: Zebei Tong, Hongchang Chen, Yujie Lei, Gang Chen, Yushi Liu, Zhi Zheng, Hao Chen, Jieming Zhang, Ying Li, Dongpu Cao 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13863v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于工业场景下的图像生成技术，特别是考虑装配关系的异常图像生成，用于增强异常检测模型性能。论文的核心技术是扩散模型和条件生成方法，涉及特征解耦、时间调制和几何先验等技术。所有关键词（共27个）中，只有"AI for Science OR Bioinformatics OR Cheminformatics"与论文有一定关联，因为论文属于AI在工业科学领域的应用（工业场景可视为科学应用的一个子领域），但论文未涉及生物信息学或化学信息学，也未明确提及大模型或深度学习技术原理的创新，因此该关键词评分为5分（有一定关联）。其他26个关键词均完全无关，因为论文未涉及大语言模型（LLMs）、模型架构（如MoE）、训练方法（如预训练、微调、对齐）、推理优化（如量化、加速）、代理系统、可解释性等大模型相关技术，也未涉及生物或化学领域的特定应用。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为PostureObjectStitch的图像生成方法，通过条件解耦、特征时间调制和几何先验技术，在工业场景中生成考虑装配关系的异常图像，以补充真实数据并提升下游异常检测模型的性能。

摘要翻译

图像生成技术能够合成特定条件下的图像，以补充现实工业异常数据并提升异常检测模型性能。现有生成技术很少考虑工业组件在装配中的姿态与方向，导致生成图像难以应用于下游任务。为此，我们提出一种名为PostureObjectStitch的新型图像合成方法，通过精确生成满足工业装配需求的图像来解决该问题。我们引入一种条件解耦方法，将输入的多视角图像分解为高频特征、纹理特征和RGB特征。特征时序调制机制使这些特征在扩散模型的时间步中自适应调整，实现从粗粒度到细粒度的渐进式生成，同时保持一致性。为确保语义准确性，我们提出一种增强关键工业元素的约束损失函数，以及一种引导组件定位以形成正确装配关系的几何先验。在MureCom数据集、我们新构建的DreamAssembly数据集及下游应用上的综合实验结果验证了本方法的优异性能。

摘要 (Abstract)

Image generation technology can synthesize condition-specific images to supplement real-world industrial anomaly data and enhance anomaly detection model performance. Existing generation techniques rarely account for the pose and orientation of industrial components in assembly, making the generated images difficult to utilize for downstream application. To solve this, we propose a novel image synthesis approach, called PostureObjectStitch, that achieves accurate generation to meet the requirement of industrial assembly. A condition decoupling approach is introduced to separate input multi-view images into high-frequency, texture, and RGB features. The feature temporal modulation mechanism adapts these features across diffusion model time-steps, enabling progressive generation from coarse to fine details while maintaining consistency. To ensure semantic accuracy, we introduce a conditional loss that enhances critical industrial elements and a geometric prior that guides component positioning for correct assembly relationships. Comprehensive experimental results on the MureCom dataset, our newly contributed DreamAssembly dataset, and the downstream application validate the outstanding performance of our method.

关键词: image generation, industrial anomaly detection, assembly relationships, diffusion model, condition decoupling, feature temporal modulation, geometric prior, PostureObjectStitch

178. ❌ DiffMagicFace: Identity Consistent Facial Editing of Real Videos

作者: Huanghao Yin, Shenkun Xu, Kanle Shi, Junhai Yong, Bin Wang 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13841v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于基于扩散模型的视频面部编辑技术，涉及扩散模型微调、数据集构建和优化算法，但完全不涉及大语言模型（LLMs）、深度学习技术原理创新或AI在科学领域的应用。所有关键词均与大语言模型、深度学习技术原理或AI科学应用相关，而本文研究的是计算机视觉中的扩散模型应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

本文提出了DiffMagicFace框架，通过集成两个微调模型解决面部视频编辑中的身份一致性和跨帧一致性问题，实现了高质量的面部视频编辑效果。

摘要翻译

文本条件驱动的图像编辑技术已从图像扩散模型的发展中显著获益。然而，将这些技术扩展到面部视频编辑领域时，面临着如何在源视频中保持面部身份一致性以及确保编辑对象在帧间连贯性的挑战。本文提出DiffMagicFace，一种独特的视频编辑框架，它整合了两个分别针对文本和图像控制进行微调的模型。这些模型在推理过程中并行运作，以生成既能保持身份特征，又能与编辑语义无缝契合的视频帧。为确保编辑视频的连贯性，我们构建了一个数据集，其中包含每个编辑对象展现不同面部视角的图像。该数据集的创建通过渲染技术及后续的优化算法实现。值得注意的是，我们的方法不依赖于视频数据集，却能在一致性与内容质量上均取得优异效果。即使对于如说话头部视频和区分高度相似类别等复杂任务，该方法依然表现卓越。使用本框架编辑的视频与采用传统渲染软件制作的视频效果相当。通过与当前先进方法的对比分析，我们的框架在视觉吸引力和量化指标上均展现出更优性能。

摘要 (Abstract)

Text-conditioned image editing has greatly benefitted from the advancements in Image Diffusion Models. However, extending these techniques to facial video editing introduces challenges in preserving facial identity throughout the source video and ensuring consistency of the edited subject across frames. In this paper, we introduce DiffMagicFace, a unique video editing framework that integrates two fine-tuned models for text and image control. These models operate concurrently during inference to produce video frames that maintain identity features while seamlessly aligning with the editing semantics. To ensure the consistency of the edited videos, we develop a dataset comprising images showcasing various facial perspectives for each edited subject. The creation of a data set is achieved through rendering techniques and the subsequent application of optimization algorithms. Remarkably, our approach does not depend on video datasets but still delivers high-quality results in both consistency and content. The excellent effect holds even for complex tasks like talking head videos and distinguishing closely related categories. The videos edited using our framework exhibit parity with videos that are made using traditional rendering software. Through comparative analysis with current state-of-the-art methods, our framework demonstrates superior performance in both visual appeal and quantitative metrics.

关键词: Diffusion Models, Facial Video Editing, Identity Consistency, Text-conditioned Editing, Video Frame Generation, Dataset Construction, Optimization Algorithms, Talking Head Videos

179. ❌ Any3DAvatar: Fast and High-Quality Full-Head 3D Avatar Reconstruction from Single Portrait Image

作者: Yujie Gao, Yao Xiao, Xiangnan Zhu, Ya Li, Yiyi Zhang, Liqing Zhang, Jianfu Zhang 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13856v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的3D头像重建技术，使用3D高斯散射和条件去噪方法，与所有评分关键词（均涉及大语言模型、深度学习技术原理或AI科学应用）无直接关联。论文未提及任何语言模型、模型训练、推理优化、对齐技术、代理系统或科学AI应用相关内容。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Any3DAvatar的快速高质量方法，从单张肖像图像重建完整的3D高斯头像，解决了现有方法在速度与质量之间的权衡问题，实现了在1秒内完成重建并保持高保真几何和纹理。

摘要翻译

从单张肖像重建完整三维头部模型仍具挑战性，因为现有方法始终面临质量与速度间的显著权衡：高保真流程通常依赖多阶段处理与逐对象优化，而快速前馈模型则难以兼顾完整几何结构与精细外观细节。为弥合这一差距，我们提出Any3DAvatar——一种快速、高质量的单图像三维高斯头部数字人生成方法，其最快配置可在1秒内重建完整头部，同时保持高保真几何与纹理。首先，我们构建了AnyHead统一数据集，融合了身份多样性、密集多视角监督与真实配饰元素，填补了现有头部数据在覆盖范围、全头几何与复杂外观方面的主要空白。其次，我们摒弃非结构化噪声采样，转而从具备普吕克坐标感知（Plücker-aware）的结构化三维高斯骨架初始化，并执行一步条件去噪，将全头重建转化为单次前向传播过程，同时维持高保真度。第三，我们在三维高斯重建基础上，对相同隐空间标记引入辅助视角条件外观监督，在不增加推理开销的前提下提升新视角纹理细节。实验表明，Any3DAvatar在渲染保真度上超越现有单图像全头重建方法，同时保持显著的速度优势。

摘要 (Abstract)

Reconstructing a complete 3D head from a single portrait remains challenging because existing methods still face a sharp quality-speed trade-off: high-fidelity pipelines often rely on multi-stage processing and per-subject optimization, while fast feed-forward models struggle with complete geometry and fine appearance details. To bridge this gap, we propose Any3DAvatar, a fast and high-quality method for single-image 3D Gaussian head avatar generation, whose fastest setting reconstructs a full head in under one second while preserving high-fidelity geometry and texture. First, we build AnyHead, a unified data suite that combines identity diversity, dense multi-view supervision, and realistic accessories, filling the main gaps of existing head data in coverage, full-head geometry, and complex appearance. Second, rather than sampling unstructured noise, we initialize from a Plücker-aware structured 3D Gaussian scaffold and perform one-step conditional denoising, formulating full-head reconstruction into a single forward pass while retaining high fidelity. Third, we introduce auxiliary view-conditioned appearance supervision on the same latent tokens alongside 3D Gaussian reconstruction, improving novel-view texture details at zero extra inference cost. Experiments show that Any3DAvatar outperforms prior single-image full-head reconstruction methods in rendering fidelity while remaining substantially faster.

关键词: 3D avatar reconstruction, single portrait image, 3D Gaussian head avatar, full-head reconstruction, fast reconstruction, high-fidelity geometry, novel-view texture, AnyHead data suite

180. ❌ A Resource-Efficient Hybrid CNN-LSTM network for image-based bean leaf disease classification

作者: Hye Jin Rhee, Joseph Damilola Akinyemi 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13835v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于农业领域的图像分类任务，使用CNN-LSTM混合架构进行豆叶病害分类，属于深度学习在特定科学应用（农业）中的研究。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、智能体等）完全无关，因为这些关键词主要针对大语言模型（LLM）及相关技术。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在农业科学（可视为广义科学应用）中的应用，但并非核心生物信息学或化学信息学，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了一种轻量级混合CNN-LSTM网络，用于豆叶病害图像分类，在保持高准确率（94.38%）的同时显著减小模型尺寸（1.86 MB），为资源受限环境提供实时农业决策支持。

摘要翻译

精准且资源高效的自动化诊断是现代农业专家系统的基石。尽管卷积神经网络（CNNs）已在植物病理学领域确立了性能基准，但其捕捉长距离空间依赖性的能力常受限于标准池化层，且其高内存占用阻碍了在便携设备上的部署。本文提出了一种用于豆类叶片病害分类的轻量级混合CNN-LSTM系统。通过集成长短期记忆网络（LSTM）层来建模特征图内的空间序列关系，我们的混合架构实现了94.38%的准确率，同时保持了仅1.86 MB的极小型号体积；相较于传统基于CNN的系统，体积减少了70%。此外，我们对图像增强策略进行了系统评估，证明为保持诊断模式的完整性，定制化的图像变换优于通用组合。在$\textit{ibean}$数据集上的结果证实，所提出的系统结合EfficientNet-B7与LSTM实现了99.22%的最新最优F1分数，为资源受限环境下的实时农业决策支持提供了一个鲁棒且可扩展的框架。本研究中使用的代码与增强数据集已在$\href{https://github.com/HJin-R/bean_disease}{Github}$仓库中公开。

摘要 (Abstract)

Accurate and resource-efficient automated diagnosis is a cornerstone of modern agricultural expert systems. While Convolutional Neural Networks (CNNs) have established benchmarks in plant pathology, their ability to capture long-range spatial dependencies is often limited by standard pooling layers, and their high memory footprint hinders deployment on portable devices. This paper proposes a lightweight hybrid CNN-LSTM system for bean leaf disease classification. By integrating an LSTM layer to model the spatial-sequential relationships within feature maps, our hybrid architecture achieves a 94.38% accuracy while maintaining an exceptionally small footprint of 1.86 MB; a 70% reduction in size compared to traditional CNN-based systems. Furthermore, we provide a systematic evaluation of image augmentation strategies, demonstrating that tailored transformations are superior to generic combinations for maintaining the integrity of diagnostic patterns. Results on the $\textit{ibean}$ dataset confirm that the proposed system achieves state-of-the-art F1 scores of 99.22% with EfficientNet-B7+LSTM, providing a robust and scalable framework for real-time agricultural decision support in resource-constrained environments. The code and augmented datasets used in this study are publicly available on this $\href{https://github.com/HJin-R/bean_disease}{Github}$ repo.

关键词: CNN-LSTM hybrid network, bean leaf disease classification, resource-efficient, lightweight model, image augmentation, agricultural expert systems, real-time decision support, model compression

181. ❌ DRG-Font: Dynamic Reference-Guided Few-shot Font Generation via Contrastive Style-Content Disentanglement

作者: Rejoy Chakraborty, Prasun Roy, Saumik Bhattacharya, Umapada Pal 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13797v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的少样本字体生成任务，提出了一种基于对比学习的风格-内容解耦方法。虽然属于AI应用范畴，但所有评分关键词均针对大语言模型（LLM）及其相关技术（如训练方法、推理优化、对齐、代理系统等），而本论文完全不涉及任何语言模型技术，也未提及生物信息学或化学信息学等科学AI应用。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为DRG-Font的少样本字体生成方法，通过对比学习实现风格与内容的解耦，并引入动态参考选择机制，在多个基准测试中显著优于现有方法。

摘要翻译

少样本字体生成旨在通过少量参考字形生成风格一致的字符。然而，从少量样本中捕捉复杂字体风格仍具挑战性，现有方法往往难以在生成样本中保留可辨识的局部特征。本文提出DRG-Font，这是一种通过解耦风格与内容嵌入空间来学习复杂字形属性的对比字体生成策略。为实现最优风格监督，所提架构引入参考选择模块（Reference Selection Module, RS Module），动态从候选池中选取最佳风格参考。网络通过多尺度风格头模块（Multi-scale Style Head Block, MSHB）与多尺度内容头模块（Multi-scale Content Head Block, MCHB）学习将字形属性分解为风格先验和形状先验。在风格适配阶段，多重融合上采样模块（Multi-Fusion Upsampling Block, MFUB）通过融合参考风格先验与目标内容先验生成目标字形。实验表明，该方法在多项视觉与分析基准测试中均显著优于现有先进方法。

摘要 (Abstract)

Few-shot Font Generation aims to generate stylistically consistent glyphs from a few reference glyphs. However, capturing complex font styles from a few exemplars remains challenging, and the existing methods often struggle to retain discernible local characteristics in generated samples. This paper introduces DRG-Font, a contrastive font generation strategy that learns complex glyph attributes by decomposing style and content embedding spaces. For optimal style supervision, the proposed architecture incorporates a Reference Selection (RS) Module to dynamically select the best style reference from an available pool of candidates. The network learns to decompose glyph attributes into style and shape priors through a Multi-scale Style Head Block (MSHB) and a Multi-scale Content Head Block (MCHB). For style adaptation, a Multi-Fusion Upsampling Block (MFUB) produces the target glyph by combining the reference style prior and target content prior. The proposed method demonstrates significant improvements over state-of-the-art approaches across multiple visual and analytical benchmarks.

关键词: Few-shot Font Generation, Style-Content Disentanglement, Contrastive Learning, Dynamic Reference Selection, Multi-scale Head Blocks, Glyph Generation, Visual Benchmark, Style Adaptation

182. ❌ Artificial intelligence application in lymphoma diagnosis with Vision Transformer using weakly supervised training

作者: Nghia, Nguyen, Amer Wahed, Andy Quesada, Yasir Ali, Hanadi El Achi, Y. Helen Zhang, Jocelyn Ursua, Alex Banerjee, Sahib Kalra, L. Jeffrey Medeiros, Jie Xu 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13795v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于使用Vision Transformer进行淋巴瘤诊断的医学影像分析，属于计算机视觉在生物医学领域的应用。论文内容与绝大多数关键词（主要涉及大语言模型技术、训练方法、推理优化等）完全无关，因为这些关键词针对的是文本/语言模型而非视觉模型。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文将深度学习应用于生物医学诊断（淋巴瘤分类），属于AI在科学/生物信息学领域的应用，因此给予10分（高度相关）。

!!! tip deepseek-chat TL;DR

该研究开发了一种基于弱监督训练的Vision Transformer模型，用于区分间变性大细胞淋巴瘤和经典霍奇金淋巴瘤，在10万张图像块上训练后达到了91.85%的准确率和0.98的AUC，证明了该方法在临床深度学习模型开发中的实用性。

摘要翻译

视觉变换器（Vision Transformers，ViT）已被证明能够实现更灵活的特征检测，并在充足数据预训练条件下可超越卷积神经网络（Convolutional Neural Network，CNN）的性能。鉴于其出色的特征检测能力，我们采用ViT对间变性大细胞淋巴瘤（Anaplastic Large Cell Lymphoma，ALCL）与经典霍奇金淋巴瘤（Classic Hodgkin Lymphoma，cHL）进行形态学分类。我们先前设计了一个ViT模型，该模型在完全监督训练模式下使用1,200个图像块的小型数据集进行训练，并在独立测试集上取得了100%的诊断准确率和1.0的F1分数。由于完全监督训练在训练和测试阶段均需大量专业资源，在实际应用中并不可行，因此我们近期研究了一种改进的训练数据方法（弱监督训练），并证明在全切片图像（Whole-Slide-Image）的玻片级别自动标注训练图像块，是视觉变换器临床应用中更为可行的解决方案。我们的ViT模型在100,000个图像块的更大数据集上训练后，评估指标显示出显著性能：准确率、F1分数和曲线下面积（Area Under the Curve，AUC）分别达到91.85%、0.92和0.98。这些指标表现优异，表明该采用弱监督训练的ViT模型可作为临床模型开发中深度学习模块的适用工具，配合自动化的图像块提取技术。

摘要 (Abstract)

Vision transformers (ViT) have been shown to allow for more flexible feature detection and can outperform convolutional neural network (CNN) when pre-trained on sufficient data. Due to their promising feature detection capabilities, we deployed ViTs for morphological classification of anaplastic large cell lymphoma (ALCL) versus classic Hodgkin lymphoma (cHL). We had previously designed a ViT model which was trained on a small dataset of 1,200 image patches in fully supervised training. That model achieved a diagnostic accuracy of 100% and an F1 score of 1.0 on the independent test set. Since fully supervised training is not a practical method due to lack of expertise resources in both the training and testing phases, we conducted a recent study on a modified approach to training data (weakly supervised training) and show that labeling training image patch automatically at the slide level of each whole-slide-image is a more practical solution for clinical use of Vision Transformer. Our ViT model, trained on a larger dataset of 100,000 image patches, yields evaluation metrics with significant accuracy, F1 score, and area under the curve (AUC) at 91.85%, 0.92, and 0.98, respectively. These are respectable values that qualify this ViT model, with weakly supervised training, as a suitable tool for a deep learning module in clinical model development using automated image patch extraction.

关键词: Vision Transformer, lymphoma diagnosis, weakly supervised training, anoplastic large cell lymphoma, classic Hodgkin lymphoma, whole-slide-image, deep learning, clinical model development

183. ❌ From Synchrony to Sequence: Exo-to-Ego Generation via Interpolation

作者: Mohammad Mahdi, Nedko Savov, Danda Pani Paudel, Luc Van Gool 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13793v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉领域的视频生成任务（Exo-to-Ego视频合成），使用扩散模型和Transformer架构解决跨视角视频生成中的时空不连续性问题。所有评分关键词均涉及大语言模型（LLM）及相关技术（如MoE、RLHF、RAG等）、模型优化方法（如量化、推理加速）或特定应用领域（如AI for Science），而本文完全不涉及语言模型、自然语言处理或大模型技术原理，也未应用于科学领域（如生物信息学）。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出Syn2Seq-Forcing方法，通过将Exo-to-Ego视频生成重新定义为连续序列建模问题，利用视频插值和扩散变换器有效解决了跨视角视频合成中的时空不连续性挑战。

摘要翻译

外视角至内视角视频生成旨在从同步的第三人称视角及相应相机位姿合成第一人称视频。虽然存在配对监督数据，同步的外视角-内视角数据本质上引入了显著的时空与几何不连续性，这违背了标准视频生成基准所依赖的平滑运动假设。我们将这种由同步性引发的跳跃问题识别为核心挑战，并提出Syn2Seq-Forcing——一种通过在源视频与目标视频之间进行插值以形成单一连续信号的序列化建模框架。通过将Exo2Ego任务重新定义为序列信号建模而非传统的条件-输出任务，我们的方法使基于扩散的序列模型（例如扩散强制变换器，DFoT）能够更有效地捕捉跨帧的连贯过渡。实验表明，仅对视频进行插值（无需执行位姿插值）已能带来显著性能提升，这强调了主要困难源于时空不连续性。除了直接的性能改进，该框架建立了一个通用且灵活的架构，能够将Exo2Ego与Ego2Exo生成任务统一在单一的连续序列模型中，为未来跨视角视频合成研究提供了原则性基础。

摘要 (Abstract)

Exo-to-Ego video generation aims to synthesize a first-person video from a synchronized third-person view and corresponding camera poses. While paired supervision is available, synchronized exo-ego data inherently introduces substantial spatio-temporal and geometric discontinuities, violating the smooth-motion assumptions of standard video generation benchmarks. We identify this synchronization-induced jump as the central challenge and propose Syn2Seq-Forcing, a sequential formulation that interpolates between the source and target videos to form a single continuous signal. By reframing Exo2Ego as sequential signal modeling rather than a conventional condition-output task, our approach enables diffusion-based sequence models, e.g. Diffusion Forcing Transformers (DFoT), to capture coherent transitions across frames more effectively. Empirically, we show that interpolating only the videos, without performing pose interpolation already produces significant improvements, emphasizing that the dominant difficulty arises from spatio-temporal discontinuities. Beyond immediate performance gains, this formulation establishes a general and flexible framework capable of unifying both Exo2Ego and Ego2Exo generation within a single continuous sequence model, providing a principled foundation for future research in cross-view video synthesis.

关键词: Exo-to-Ego video generation, video synthesis, diffusion models, sequence modeling, spatio-temporal discontinuities, interpolation, cross-view generation, Diffusion Forcing Transformers

184. ❌ PBE-UNet: A light weight Progressive Boundary-Enhanced U-Net with Scale-Aware Aggregation for Ultrasound Image Segmentation

作者: Chen Wang, Yixin Zhu, Yongbin Zhu, Fengyuan Shi, Qi Li, Jun Wang, Zuozhu Liu, Keli Hu 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13791v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于医学图像分割，提出了一种改进的U-Net架构（PBE-UNet），用于超声图像中的病灶分割。论文内容与绝大多数关键词（涉及大模型技术、训练方法、推理优化、对齐、代理等）完全无关，因为这些关键词均针对大型语言模型（LLMs）及相关技术。唯一略有相关的是“AI for Science OR Bioinformatics OR Cheminformatics”，因为论文属于AI在生物医学（超声图像分析）领域的应用，属于“AI for Science”的范畴，但并非核心聚焦于大模型或深度学习技术原理的创新，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了一种轻量级渐进边界增强U-Net（PBE-UNet），通过尺度感知聚合和边界引导特征增强模块，解决了超声图像中病灶分割因低对比度、模糊边界和尺度变化带来的挑战，在多个基准数据集上超越了现有方法。

摘要翻译

超声图像中病灶的精确分割对于预防性筛查与临床诊断至关重要，但由于图像对比度低、边界模糊以及尺度差异显著，这仍是一项具有挑战性的任务。尽管现有的基于深度学习的方法已取得显著成效，但这些方法在处理尺度变化和模糊肿瘤边界时仍存在困难。为解决这些挑战，我们提出了一种渐进式边界增强U-Net（PBE-UNet）。具体而言，我们首先引入了一个尺度感知聚合模块（Scale-Aware Aggregation Module, SAAM），该模块能动态调整其感受野以捕获鲁棒的多尺度上下文信息。随后，我们提出了边界引导特征增强（Boundary-Guided Feature Enhancement, BGFE）模块来强化特征表示。我们发现狭窄的边界区域与较宽的分割误差区域之间存在较大差异。与现有方法将边界视为静态掩码不同，BGFE模块将狭窄的边界预测逐步扩展为更宽的空间注意力图。因此，更宽的空间注意力图能够有效覆盖更广泛的分割误差区域，并增强模型对这些困难区域的关注。我们在四个超声基准数据集（BUSI、Dataset B、TN3K和BP）上进行了大量实验。实验结果表明，我们提出的PBE-UNet在性能上超越了当前最先进的超声图像分割方法。代码公开于https://github.com/cruelMouth/PBE-UNet。

摘要 (Abstract)

Accurate lesion segmentation in ultrasound images is essential for preventive screening and clinical diagnosis, yet remains challenging due to low contrast, blurry boundaries, and significant scale variations. Although existing deep learning-based methods have achieved remarkable performance, these methods still struggle with scale variations and indistinct tumor boundaries. To address these challenges, we propose a progressive boundary enhanced U-Net (PBE-UNet). Specially, we first introduce a scale-aware aggregation module (SAAM) that dynamically adjusts its receptive field to capture robust multi-scale contextual information. Then, we propose a boundary-guided feature enhancement (BGFE) module to enhance the feature representations. We find that there are large gaps between the narrow boundary and the wide segmentation error areas. Unlike existing methods that treat boundaries as static masks, the BGFE module progressively expands the narrow boundary prediction into broader spatial attention maps. Thus, broader spatial attention maps could effectively cover the wider segmentation error regions and enhance the model’s focus on these challenging areas. We conduct expensive experiments on four benchmark ultrasound datasets, BUSI, Dataset B, TN3K, and BP. The experimental results how that our proposed PBE-UNet outperforms state-of-the-art ultrasound image segmentation methods. The code is at https://github.com/cruelMouth/PBE-UNet.

关键词: ultrasound image segmentation, PBE-UNet, scale-aware aggregation, boundary enhancement, medical image analysis, deep learning, U-Net, lesion segmentation

185. ❌ Temporally Consistent Long-Term Memory for 3D Single Object Tracking

作者: Jaejoon Yoo, SuBeen Lee, Yerim Jeon, Miso Lee, Jae-Pil Heo 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13789v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于3D单目标跟踪（3D-SOT）的计算机视觉任务，提出了一种名为ChronoTrack的长期记忆框架，用于解决LiDAR点云序列中的目标跟踪问题。论文的核心技术涉及时间一致性损失、内存循环一致性、记忆令牌和实时性能优化。所有给定的关键词均与大语言模型（LLMs）、深度学习技术原理或AI在科学领域的应用（如生物信息学）相关，而本论文的研究内容属于计算机视觉中的3D目标跟踪，与这些关键词的主题领域完全不同，因此所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对3D单目标跟踪中现有方法受限于短期上下文的问题，提出了ChronoTrack框架，通过时间一致性损失和内存循环一致性来维护长期记忆，在多个基准测试中实现了最先进的性能，并在单个RTX 4090 GPU上以42 FPS的速度实时运行。

摘要翻译

三维单目标跟踪（3D Single Object Tracking，3D-SOT）旨在给定目标物体在第一帧中的三维边界框，在连续的激光雷达点云序列中对其进行定位。现有方法多采用基于记忆的框架以利用先前观测到的目标特征，但仍局限于仅使用最近少数帧的信息。本文揭示，由于严重的时间特征不一致性和过高的内存开销，这些方法的时间容量本质上受限于短期上下文。为此，我们提出了一种鲁棒的长时三维单目标跟踪框架 ChronoTrack，该框架在通过长时记忆高效聚合多样化目标特征的同时，保持了时间特征的一致性。基于一组紧凑的可学习记忆令牌，ChronoTrack 通过两个互补的目标利用长时信息：时间一致性损失和记忆循环一致性损失。前者强制实现跨帧的特征对齐，缓解时间漂移并提升所提出的长时记忆的可靠性；同时，后者通过“记忆-点-记忆”循环游走机制，促使每个令牌编码在整个序列中观测到的多样化且具有判别性的目标表征。实验结果表明，ChronoTrack 在多个三维单目标跟踪基准测试中取得了新的最优性能，证明了其在紧凑内存下进行长时目标建模的有效性，并在单块 RTX 4090 GPU 上实现了 42 FPS 的实时运行速度。代码发布于 https://github.com/ujaejoon/ChronoTrack。

摘要 (Abstract)

3D Single Object Tracking (3D-SOT) aims to localize a target object across a sequence of LiDAR point clouds, given its 3D bounding box in the first frame. Recent methods have adopted a memory-based approach to utilize previously observed features of the target object, but remain limited to only a few recent frames. This work reveals that their temporal capacity is fundamentally constrained to short-term context due to severe temporal feature inconsistency and excessive memory overhead. To this end, we propose a robust long-term 3D-SOT framework, ChronoTrack, which preserves the temporal feature consistency while efficiently aggregating the diverse target features via long-term memory. Based on a compact set of learnable memory tokens, ChronoTrack leverages long-term information through two complementary objectives: a temporal consistency loss and a memory cycle consistency loss. The former enforces feature alignment across frames, alleviating temporal drift and improving the reliability of proposed long-term memory. In parallel, the latter encourages each token to encode diverse and discriminative target representations observed throughout the sequence via memory-point-memory cyclic walks. As a result, ChronoTrack achieves new state-of-the-art performance on multiple 3D-SOT benchmarks, demonstrating its effectiveness in long-term target modeling with compact memory while running at real-time speed of 42 FPS on a single RTX 4090 GPU. The code is available at https://github.com/ujaejoon/ChronoTrack

关键词: 3D Single Object Tracking, LiDAR point clouds, long-term memory, temporal consistency, memory tokens, real-time tracking, ChronoTrack, state-of-the-art performance

186. ❌ Failure Identification in Imitation Learning Via Statistical and Semantic Filtering

作者: Quentin Rolland, Fabrice Mayran de Chamisso, Jean-Baptiste Mouret 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13788v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究机器人模仿学习中的失败检测，使用Vision-Language Model（VLM）进行语义过滤，属于AI在机器人领域的应用。所有关键词均与大模型技术原理、训练方法、推理优化、对齐、代理系统等直接相关，但论文仅涉及VLM的简单应用（语义过滤），未深入探讨大模型技术本身，因此除’AI for Science’（广义科学应用）给5分外，其余均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为FIDeL的策略无关失败检测模块，通过异常检测、最优传输匹配和视觉语言模型语义过滤来识别机器人模仿学习中的真实失败，并在BotFails数据集上显著优于现有方法。

摘要翻译

机器人领域的模仿学习策略在受控环境中表现出色，但在实际部署中仍显脆弱：诸如硬件故障、零件缺陷、意外人为动作或任何超出训练分布的状态等罕见事件，均可能导致执行失败。基于视觉的异常检测方法已成为检测此类异常故障状态的适用方案，但无法区分故障与良性偏差。本文提出FIDeL（演示学习中的故障识别），一种独立于策略的故障检测模块。该方法借助前沿异常检测技术，构建示范数据的紧凑表征，并通过最优传输匹配对齐实时观测数据，以生成异常分数与热力图。我们通过扩展共形预测推导时空阈值，并利用视觉-语言模型进行语义过滤，以区分良性异常与真实故障。同时，我们引入BotFails——一个用于机器人故障检测的多模态真实任务数据集。实验表明，FIDeL在各项基准测试中持续优于现有方法，在BotFails数据集上相比现有技术实现了异常检测AUROC指标提升5.30%，故障检测准确率提升17.38%。

摘要 (Abstract)

Imitation learning (IL) policies in robotics deliver strong performance in controlled settings but remain brittle in real-world deployments: rare events such as hardware faults, defective parts, unexpected human actions, or any state that lies outside the training distribution can lead to failed executions. Vision-based Anomaly Detection (AD) methods emerged as an appropriate solution to detect these anomalous failure states but do not distinguish failures from benign deviations. We introduce FIDeL (Failure Identification in Demonstration Learning), a policy-independent failure detection module. Leveraging recent AD methods, FIDeL builds a compact representation of nominal demonstrations and aligns incoming observations via optimal transport matching to produce anomaly scores and heatmaps. Spatio-temporal thresholds are derived with an extension of conformal prediction, and a Vision-Language Model (VLM) performs semantic filtering to discriminate benign anomalies from genuine failures. We also introduce BotFails, a multimodal dataset of real-world tasks for failure detection in robotics. FIDeL consistently outperforms state-of-the-art baselines, yielding +5.30% percent AUROC in anomaly detection and +17.38% percent failure-detection accuracy on BotFails compared to existing methods.

关键词: Imitation Learning, Failure Detection, Anomaly Detection, Vision-Language Model, Optimal Transport, Conformal Prediction, Robotics, BotFails Dataset

187. ❌ Design and Behavior of Sparse Mixture-of-Experts Layers in CNN-based Semantic Segmentation

作者: Svetlana Pavlitska, Haixi Fan, Konstantin Ditschuneit, J. Marius Zöllner 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13761v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究稀疏混合专家（MoE）层在CNN语义分割中的应用，与关键词’Mixture of Experts OR MoE OR Sparse Models’高度相关（10分），因为论文标题和摘要明确聚焦于稀疏MoE层的设计、行为和集成到CNN中。其他关键词均不相关（0分），因为论文专注于计算机视觉中的CNN架构和语义分割，未涉及大语言模型、训练技术、推理方法、对齐、代理、压缩、科学AI应用等主题。

!!! tip deepseek-chat TL;DR

该论文研究了稀疏混合专家（MoE）层在基于CNN的语义分割中的设计和行为，通过实验在Cityscapes和BDD100K数据集上展示了架构依赖的性能提升（最高+3.9 mIoU）和低计算开销，同时揭示了设计敏感性。

摘要翻译

稀疏专家混合（MoE）层已被证明能显著提升模型容量，而无需按比例增加计算成本，因而广泛应用于Transformer架构中，通常用于替代前馈网络模块。相比之下，将稀疏MoE层整合到卷积神经网络（CNN）中的方法仍缺乏一致性，先前的研究大多集中在滤波器或通道级别的细粒度MoE设计上。在本研究中，我们针对语义分割任务探索了一种更粗粒度的、基于图像块的稀疏MoE层构建方法，该方法将局部区域路由至一个由卷积专家组成的小型子集。通过在Cityscapes和BDD100K数据集上使用编码器-解码器及基于骨干网络的CNN进行实验，我们开展了设计分析以评估架构选择如何影响路由动态和专家专业化。实验结果表明，该方法在计算开销极小的前提下，能带来稳定且依赖架构的性能提升（最高达+3.9 mIoU），同时也揭示了强烈的设计敏感性。本研究为基于CNN的密集预测任务中稀疏MoE层的设计与内部动态提供了实证性见解。代码发布于https://github.com/KASTEL-MobilityLab/moe-layers/。

摘要 (Abstract)

Sparse mixture-of-experts (MoE) layers have been shown to substantially increase model capacity without a proportional increase in computational cost and are widely used in transformer architectures, where they typically replace feed-forward network blocks. In contrast, integrating sparse MoE layers into convolutional neural networks (CNNs) remains inconsistent, with most prior work focusing on fine-grained MoEs operating at the filter or channel levels. In this work, we investigate a coarser, patch-wise formulation of sparse MoE layers for semantic segmentation, where local regions are routed to a small subset of convolutional experts. Through experiments on the Cityscapes and BDD100K datasets using encoder-decoder and backbone-based CNNs, we conduct a design analysis to assess how architectural choices affect routing dynamics and expert specialization. Our results demonstrate consistent, architecture-dependent improvements (up to +3.9 mIoU) with little computational overhead, while revealing strong design sensitivity. Our work provides empirical insights into the design and internal dynamics of sparse MoE layers in CNN-based dense prediction. Our code is available at https://github.com/KASTEL-MobilityLab/moe-layers/.

关键词: Sparse Mixture-of-Experts, MoE layers, Convolutional Neural Networks, Semantic Segmentation, CNN-based Dense Prediction, Routing Dynamics, Expert Specialization, Patch-wise Formulation

188. ❌ ClipGStream: Clip-Stream Gaussian Splatting for Any Length and Any Motion Multi-View Dynamic Scene Reconstruction

作者: Jie Liang, Jiahao Wu, Chao Wang, Jiayu Yang, Xiaoyun Zheng, Kaiqiang Xiong, Zhanke Wang, Jinbo Yan, Feng Gao, Ronggang Wang 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13746v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的动态3D场景重建，提出了一种基于高斯泼溅的混合重建框架ClipGStream，用于处理长多视角序列和大规模运动。论文内容涉及3D重建、高斯泼溅、时空建模、内存优化等计算机视觉技术，但完全不涉及大语言模型、深度学习技术原理、AI for Science等关键词领域。所有关键词均与大模型、深度学习技术、AI科学应用相关，而本文是纯粹的计算机视觉3D重建研究，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出ClipGStream框架，通过Clip-Stream混合设计和时空场建模，解决了长动态多视角视频重建中的可扩展性、时间稳定性和内存效率问题，实现了高质量、无闪烁的重建。

摘要翻译

动态三维场景重建对于VR、MR和XR等沉浸式媒体至关重要，但在处理具有大规模运动的长时序多视角序列时仍面临挑战。现有基于动态高斯的方法主要分为两类：帧流式方法虽具备可扩展性但时间稳定性较差，而片段式方法虽能实现局部一致性，却以高内存消耗和有限序列长度为代价。我们提出ClipGStream——一种在片段层级而非帧层级进行流式优化的混合重建框架。该框架将长序列分割为短片段，通过片段独立的时空场与残差锚点补偿机制高效建模局部动态变化，同时利用片段间继承的锚点与解码器保持跨片段的结构一致性。这种“片段-流”设计能够以高时间连贯性和更低内存开销，实现长动态视频的可扩展、无闪烁重建。大量实验表明，ClipGStream在重建质量与效率上均达到先进水平。项目页面详见：https://liangjie1999.github.io/ClipGStreamWeb/

摘要 (Abstract)

Dynamic 3D scene reconstruction is essential for immersive media such as VR, MR, and XR, yet remains challenging for long multi-view sequences with large-scale motion. Existing dynamic Gaussian approaches are either Frame-Stream, offering scalability but poor temporal stability, or Clip, achieving local consistency at the cost of high memory and limited sequence length. We propose ClipGStream, a hybrid reconstruction framework that performs stream optimization at the clip level rather than the frame level. The sequence is divided into short clips, where dynamic motion is modeled using clip-independent spatio-temporal fields and residual anchor compensation to capture local variations efficiently, while inter-clip inherited anchors and decoders maintain structural consistency across clips. This Clip-Stream design enables scalable, flicker-free reconstruction of long dynamic videos with high temporal coherence and reduced memory overhead. Extensive experiments demonstrate that ClipGStream achieves state-of-the-art reconstruction quality and efficiency. The project page is available at: https://liangjie1999.github.io/ClipGStreamWeb/

关键词: dynamic 3D scene reconstruction, Gaussian splatting, multi-view sequences, temporal stability, memory efficiency, clip-stream optimization, spatio-temporal fields, anchor compensation

189. ❌ ReConText3D: Replay-based Continual Text-to-3D Generation

作者: Muhammad Ahmed Ullah Khan, Muhammad Haris Bin Amir, Didier Stricker, Muhammad Zeshan Afzal 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13730v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于文本到3D生成的持续学习（Continual Learning），与大多数关键词无关。仅与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（5分），因为论文涉及持续学习（Continual Learning），这是领域适应（Domain Adaptation）的一种形式，但论文未明确提及预训练（Pre-training）。其他关键词主要针对大语言模型（LLMs）及其相关技术（如微调、推理、对齐、代理等），而本文研究的是文本到3D生成模型，属于计算机视觉/图形学领域，未涉及LLMs或相关技术。

!!! tip deepseek-chat TL;DR

本文提出了ReConText3D，首个用于文本到3D生成的持续学习框架，解决了现有模型在增量训练中的灾难性遗忘问题，并通过基于文本嵌入的重放方法在Toys4K-CL基准上实现了新旧类别的高质量生成。

摘要翻译

持续学习使模型能够随时间获取新知识，同时保留先前习得的能力。然而，其在文本到3D生成领域的应用尚未得到探索。我们提出了ReConText3D，这是首个用于持续文本到3D生成的框架。我们首先证明，现有的文本到3D模型在增量训练下会遭受灾难性遗忘。ReConText3D使生成模型能够从文本描述中增量学习新的3D类别，同时保持合成已见过资产的能力。我们的方法通过文本嵌入k中心选择构建了一个紧凑且多样化的回放记忆，从而在不修改底层架构的情况下实现对先前知识的代表性复现。为了系统评估持续文本到3D学习，我们引入了Toys4K-CL基准，该基准源自Toys4K数据集，提供了平衡且语义多样的类增量划分。在Toys4K-CL基准上的大量实验表明，ReConText3D在不同生成骨干网络上始终优于所有基线方法，同时为旧类别和新类别保持高质量的生成效果。据我们所知，这项工作建立了首个用于文本到3D生成的持续学习框架和基准，为增量式3D生成建模开辟了新方向。项目页面位于：https://mauk95.github.io/ReConText3D/。

摘要 (Abstract)

Continual learning enables models to acquire new knowledge over time while retaining previously learned capabilities. However, its application to text-to-3D generation remains unexplored. We present ReConText3D, the first framework for continual text-to-3D generation. We first demonstrate that existing text-to-3D models suffer from catastrophic forgetting under incremental training. ReConText3D enables generative models to incrementally learn new 3D categories from textual descriptions while preserving the ability to synthesize previously seen assets. Our method constructs a compact and diverse replay memory through text-embedding k-Center selection, allowing representative rehearsal of prior knowledge without modifying the underlying architecture. To systematically evaluate continual text-to-3D learning, we introduce Toys4K-CL, a benchmark derived from the Toys4K dataset that provides balanced and semantically diverse class-incremental splits. Extensive experiments on the Toys4K-CL benchmark show that ReConText3D consistently outperforms all baselines across different generative backbones, maintaining high-quality generation for both old and new classes. To the best of our knowledge, this work establishes the first continual learning framework and benchmark for text-to-3D generation, opening a new direction for incremental 3D generative modeling. Project page is available at: https://mauk95.github.io/ReConText3D/.

关键词: Continual Learning, Text-to-3D Generation, Catastrophic Forgetting, Replay Memory, Incremental Training, 3D Generative Modeling, Benchmark, Toys4K-CL

190. ❌ Granularity-Aware Transfer for Tree Instance Segmentation in Synthetic and Real Forests

作者: Pankaj Deoli, Atef Tej, Anmol Ashri, Anandatirtha JS, Karsten Berns 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13722v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的合成到真实数据迁移学习，特别是森林场景中的树木实例分割，使用了知识蒸馏等技术。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、代理系统等）完全无关。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究属于AI在环境科学/林业领域的应用，但论文本身并未强调生物信息学或化学信息学，因此仅给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文研究了在仅有粗粒度真实标签和细粒度合成标签的约束下，如何通过粒度感知的知识蒸馏方法，实现森林场景中树木实例分割的合成到真实数据有效迁移，并在实验中提升了掩码AP性能。

摘要翻译

本文针对林业感知中合成数据向真实数据迁移的挑战：真实数据仅具有粗略的树木整体标签，而合成数据则可提供细粒度的树干/树冠标注。我们提出了MGTD数据集，这是一个包含5.3万张合成图像与3600张真实图像的混合粒度数据集，并设计了一个四阶段实验方案以分离领域偏移与粒度不匹配的影响。我们的核心贡献是粒度感知蒸馏方法，该方法通过逻辑空间融合与掩码统一技术，将细粒度合成教师模型的结构先验知识迁移至仅使用粗标签的学生模型。实验结果表明，该方法在掩码平均精度上取得稳定提升，尤其对小尺寸及远处树木的检测效果显著，从而为标签粒度约束下的仿真-真实迁移研究建立了一个基准测试平台。

摘要 (Abstract)

We address the challenge of synthetic-to-real transfer in forestry perception where real data have only coarse Tree labels while synthetic data provide fine-grained trunk/crown annotations. We introduce MGTD, a mixed-granularity dataset with 53k synthetic and 3.6k real images, and a four-stage protocol isolating domain shift and granularity mismatch. Our core contribution is granularity-aware distillation, which transfers structural priors from fine-grained synthetic teachers to a coarse-label student via logit-space merging and mask unification. Experiments show consistent mask AP gains, especially for small/distant trees, establishing a testbed for Sim-Real transfer under label granularity constraints.

关键词: synthetic-to-real transfer, tree instance segmentation, granularity-aware distillation, mixed-granularity dataset, forestry perception, domain shift, mask AP, Sim-Real transfer

191. ❌ SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs

作者: Haoran Lou, Ziyan Liu, Chunxiao Fan, Yuexin Wu, Yue Ming 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13710v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究多模态大语言模型（MLLMs）的检索适应问题，提出SLQ框架，通过共享潜在查询将冻结的MLLM适配为检索器。与关键词高度相关的包括：1）‘Large Language Models’（论文聚焦MLLMs，是LLMs的扩展，权重1.0，相关度10）；2）‘PEFT’（论文明确对比LoRA，提出非侵入式参数更新方法，权重1.0，相关度10）；3）‘Retrieval-Augmented Generation’（论文主题是检索适应，涉及检索生成技术，权重1.0，相关度10）。其他关键词如MoE、SLMs、Scaling Laws等与论文内容无直接关联，相关度为0。

!!! tip deepseek-chat TL;DR

该论文研究了如何将多模态大语言模型（MLLMs）高效适配为检索器的问题，提出了SLQ框架，通过共享潜在查询在保持模型冻结的情况下实现跨模态检索，并在多个基准测试中优于全微调和LoRA方法。

摘要翻译

多模态大语言模型（MLLMs）展现出强大的推理能力和世界知识，但将其适配用于检索任务仍具挑战性。现有方法依赖于侵入式的参数更新，例如全量微调与低秩适应（LoRA），这可能会破坏预训练的语义空间，并损害对推理至关重要的结构化知识。本文主张，将MLLMs适配用于检索应侧重于激发其预训练表征，而非覆盖它们。为此，我们提出SLQ——一个高效且有效的框架，通过一组共享潜在查询（Shared Latent Queries）将冻结的MLLM适配为检索器。这些查询附加在文本和图像令牌序列的末端，利用模型固有的因果注意力机制作为全局聚合接口，在保持骨干网络不变的同时，在统一空间中生成紧凑的嵌入表示。此外，为了在表层模式匹配之外更有效地评估检索性能，我们构建了KARR-Bench，这是一个专为知识感知推理检索设计的基准测试。大量实验表明，SLQ在COCO和Flickr30K数据集上表现优于全量微调和LoRA，同时在MMEB上取得有竞争力的性能，并在KARR-Bench上实现显著提升。结果表明，保留预训练表征的SLQ为MLLMs适配于检索任务提供了一个高效且有效的框架。

摘要 (Abstract)

Multimodal Large Language Models (MLLMs) exhibit strong reasoning and world knowledge, yet adapting them for retrieval remains challenging. Existing approaches rely on invasive parameter updates, such as full fine-tuning and LoRA, which may disrupt the pre-trained semantic space and impair the structured knowledge essential for reasoning. In this work, we argue that adapting MLLMs for retrieval should focus on eliciting pre-trained representations rather than overwriting them. To this end, we propose SLQ, an effective and efficient framework that adapts a frozen MLLM into a retriever through a small set of Shared Latent Queries. Appended to the end of both text and image token sequences, these queries leverage the model’s native causal attention to serve as global aggregation interfaces, producing compact embeddings in a unified space while keeping the backbone unchanged. Furthermore, to better evaluate retrieval beyond superficial pattern matching, we construct KARR-Bench, a benchmark designed for knowledge-aware reasoning retrieval. Extensive experiments show that SLQ outperforms full fine-tuning and LoRA on COCO and Flickr30K, while achieving competitive performance on MMEB and yielding substantial gains on KARR-Bench. The results demonstrate that SLQ, which preserves pre-trained representations, provides an effective and efficient framework for adapting MLLMs to retrieval.

关键词: Multimodal Large Language Models, Retrieval Adaptation, Frozen MLLMs, Shared Latent Queries, Parameter-efficient Fine-tuning, Knowledge-aware Reasoning, Cross-modal Retrieval, Benchmark Evaluation

192. ❌ From Pixels to Nucleotides: End-to-End Token-Based Video Compression for DNA Storage

作者: Cihan Ruan, Lebin Zhou, Bingqing Zhao, Rongduo Han, Qiming Yuan, Chenchen Zhu, Linyi Han, Liang Yang, Wei Wang, Wei Jiang, Nam Ling 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13667v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究视频压缩与DNA存储的联合优化，属于AI在科学领域的应用（具体为生物信息学/存储技术），因此仅与’AI for Science OR Bioinformatics OR Cheminformatics’关键词有一定关联（5分）。论文未涉及大模型、深度学习技术原理、训练方法、推理优化、对齐、代理系统等核心主题，与其他关键词完全无关（0分）。

!!! tip deepseek-chat TL;DR

该论文解决了视频DNA存储中压缩与编码分离的挑战，提出了首个端到端神经网络HELIX，通过令牌表示对齐DNA四进制字母表，实现了1.91比特/核苷酸的压缩效率。

摘要翻译

基于DNA的存储已成为应对全球数据危机的一种前景广阔的方法，其具备分子级密度、千年尺度稳定性及低维护成本的优势。过去十年中，在将文本、图像和文件存储于DNA方面已取得实质性进展——然而视频存储仍是一个待攻克的难题。这一挑战不仅源于技术层面：有效的视频DNA存储需要从头协同设计压缩与分子编码，该问题处于两个长期独立发展领域的交叉点。本研究提出HELIX，首个端到端联合优化视频压缩与DNA编码的神经网络——以往方法将两个阶段独立处理，导致生化约束与压缩目标在根本上无法对齐。我们的核心洞见在于：基于令牌的表示天然契合DNA的四进制字母表——离散语义单元可直接映射至ATCG碱基。我们引入TK-SCONE（令牌-克罗内克结构化约束优化神经编码），通过打破空间相关性的克罗内克结构混合技术及保障生化约束的有限状态机映射，实现了每核苷酸1.91比特的存储效率。与两阶段方法不同，HELIX同步学习针对视觉质量、掩码预测和DNA合成效率进行联合优化的令牌分布。本工作首次证明：学习型压缩与分子存储可在令牌表示层面自然融合——这预示了一种新的范式，即神经视频编解码器可从根本上为生物基质进行原生设计。

摘要 (Abstract)

DNA-based storage has emerged as a promising approach to the global data crisis, offering molecular-scale density and millennial-scale stability at low maintenance cost. Over the past decade, substantial progress has been made in storing text, images, and files in DNA – yet video remains an open challenge. The difficulty is not merely technical: effective video DNA storage requires co-designing compression and molecular encoding from the ground up, a challenge that sits at the intersection of two fields that have largely evolved independently. In this work, we present HELIX, the first end-to-end neural network jointly optimizing video compression and DNA encoding – prior approaches treat the two stages independently, leaving biochemical constraints and compression objectives fundamentally misaligned. Our key insight: token-based representations naturally align with DNA’s quaternary alphabet – discrete semantic units map directly to ATCG bases. We introduce TK-SCONE (Token-Kronecker Structured Constraint-Optimized Neural Encoding), which achieves 1.91 bits per nucleotide through Kronecker-structured mixing that breaks spatial correlations and FSM-based mapping that guarantees biochemical constraints. Unlike two-stage approaches, HELIX learns token distributions simultaneously optimized for visual quality, prediction under masking, and DNA synthesis efficiency. This work demonstrates for the first time that learned compression and molecular storage converge naturally at token representations – suggesting a new paradigm where neural video codecs are designed for biological substrates from the ground up.

关键词: DNA storage, video compression, neural network, end-to-end optimization, token-based representation, molecular encoding, HELIX, TK-SCONE

193. ❌ Automatic Charge State Tuning of 300 mm FDSOI Quantum Dots Using Neural Network Segmentation of Charge Stability Diagram

作者: Peter Samaha, Amine Torki, Ysaline Renaud, Sam Fiette, Emmanuel Chanrion, Pierre-Andre Mortemousque, Yann Beilliard 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13662v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于使用卷积神经网络（CNN）进行量子点电荷稳定性图的语义分割，以实现自动化电荷调谐。论文的核心是深度学习在量子计算实验物理中的应用，具体涉及U-Net风格的CNN和MobileNetV2编码器。所有关键词均与大语言模型（LLM）或通用大模型技术相关，而本文使用的是传统的CNN进行图像分割，未涉及LLM、MoE、缩放定律、预训练、对齐、推理加速、智能体等大模型相关技术。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文将深度学习应用于科学领域（量子计算），属于AI for Science的范畴，但并非核心内容，因此给予5分（有一定关联）。其他关键词均完全无关，得0分。

!!! tip deepseek-chat TL;DR

该研究解决了硅量子点电荷调谐的自动化瓶颈问题，通过深度学习驱动的语义分割管道分析电荷稳定性图，实现了80%的离线调谐成功率，为量子比特的高通量自动化调谐提供了实用步骤。

摘要翻译

栅极定义半导体量子点（QDs）的调节是扩展自旋量子比特技术的主要瓶颈。我们提出了一种深度学习（DL）驱动的语义分割流程，通过在全电荷稳定图（CSDs）中定位跃迁线来实现电荷自动调节，并返回单电荷区域的栅极电压目标值。我们收集并手动标注了一个包含1015个实验测量CSDs的大型异构数据集，这些数据来自硅量子点器件，涵盖九种设计几何结构、多个晶圆和制造批次。采用以MobileNetV2为编码器的U-Net风格卷积神经网络（CNN），通过五折分组交叉验证进行训练和验证。我们的模型在定位单电荷区域方面实现了80.0%的整体离线调节成功率，部分设计结构的峰值性能超过88%。我们分析了主要失效模式并提出了针对性改进方案。最后，大范围图谱分割也自然支持可扩展的基于物理的特征提取，这些特征可反馈至制造与设计流程，并为在低温晶圆探针台中实现实时集成规划了路线。总体而言，我们的研究结果表明，基于神经网络（NN）的大范围图谱分割是朝着实现硅量子点量子比特自动化、高通量电荷调节迈出的切实一步。

摘要 (Abstract)

Tuning of gate-defined semiconductor quantum dots (QDs) is a major bottleneck for scaling spin qubit technologies. We present a deep learning (DL) driven, semantic-segmentation pipeline that performs charge auto-tuning by locating transition lines in full charge stability diagrams (CSDs) and returns gate voltage targets for the single charge regime. We assemble and manually annotate a large, heterogeneous dataset of 1015 experimental CSDs measured from silicon QD devices, spanning nine design geometries, multiple wafers, and fabrication runs. A U-Net style convolutional neural network (CNN) with a MobileNetV2 encoder is trained and validated through five-fold group cross validation. Our model achieves an overall offline tuning success of 80.0% in locating the single-charge regime, with peak performance exceeding 88% for some designs. We analyze dominant failure modes and propose targeted mitigations. Finally, wide-range diagram segmentation also naturally enables scalable physic-based feature extraction that can feed back to fabrication and design workflows and outline a roadmap for real-time integration in a cryogenic wafer prober. Overall, our results show that neural network (NN) based wide-diagram segmentation is a practical step toward automated, high-throughput charge tuning for silicon QD qubits.

关键词: quantum dots, charge stability diagrams, semantic segmentation, deep learning, convolutional neural network, automated tuning, silicon qubits, U-Net

194. ❌ VRAG-DFD: Verifiable Retrieval-Augmentation for MLLM-based Deepfake Detection

作者: Hui Han, Shunli Wang, Yandan Zhao, Taiping Yao, Shouhong Ding 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13660v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究MLLM（多模态大语言模型）在Deepfake检测中的应用，属于大模型在科学领域（AI for Science）的应用创新。高度相关的关键词包括：1) ‘Large Language Models’（论文使用MLLM）；2) ‘Retrieval-Augmented Generation’（核心方法RAG）；3) ‘Chain of Thought’（构建Forensic Chain-of-Thought数据集）；4) ‘Post-training’和’Instruction Tuning’（采用三阶段训练方法：Alignment->SFT->GRPO）。其他关键词如MoE、SLMs、Scaling Laws、RLHF等未在论文中涉及，故评分为0。

!!! tip deepseek-chat TL;DR

该论文针对MLLM在Deepfake检测中缺乏专业伪造知识的问题，提出了VRAG-DFD框架，通过结合检索增强生成（RAG）和强化学习技术，实现了动态伪造知识检索和关键推理能力，在DFD泛化测试中取得了SOTA性能。

摘要翻译

在深度伪造检测任务中，研究者提出了两类基于多模态大语言模型的方法：一是与小型DFD检测器进行互补结合，二是注入静态伪造知识。专业伪造知识的缺乏制约了这些DFD-MLLM模型的性能表现。为解决这一问题，我们深入探讨了两个关键议题：如何为MLLM提供高质量的关联伪造知识？以及如何在存在噪声参考信息的情况下赋予MLLM批判性推理能力？值得注意的是，我们尝试通过结合检索增强生成技术与强化学习，对上述问题给出初步解决方案。基于RAG与RL技术，我们提出了VRAG-DFD框架，该框架具备精准的动态伪造知识检索能力和强大的批判性推理功能。具体而言，在数据层面，我们利用RAG构建了两个数据集：用于DFD知识标注的取证知识库，以及用于构建批判性思维链的取证思维链数据集。在模型训练层面，我们采用三阶段训练法（对齐→监督微调→生成式强化学习优化）逐步培养MLLM的批判性推理能力。在性能表现上，VRAG-DFD在DFD泛化测试中取得了领先且具有竞争力的性能。

摘要 (Abstract)

In Deepfake Detection (DFD) tasks, researchers proposed two types of MLLM-based methods: complementary combination with small DFD detectors, or static forgery knowledge injection.The lack of professional forgery knowledge hinders the performance of these DFD-MLLMs.To solve this, we deeply considered two insightful issues: How to provide high-quality associated forgery knowledge for MLLMs? AND How to endow MLLMs with critical reasoning abilities given noisy reference information? Notably, we attempted to address above two questions with preliminary answers by leveraging the combination of Retrieval-Augmented Generation (RAG) and Reinforcement Learning (RL).Through RAG and RL techniques, we propose the VRAG-DFD framework with accurate dynamic forgery knowledge retrieval and powerful critical reasoning capabilities.Specifically, in terms of data, we constructed two datasets with RAG: Forensic Knowledge Database (FKD) for DFD knowledge annotation, and Forensic Chain-of-Thought Dataset (F-CoT), for critical CoT construction.In terms of model training, we adopt a three-stage training method (Alignment->SFT->GRPO) to gradually cultivate the critical reasoning ability of the MLLM.In terms of performance, VRAG-DFD achieved SOTA and competitive performance on DFD generalization testing.

关键词: Deepfake Detection, MLLM, Retrieval-Augmented Generation, Chain of Thought, Supervised Fine-tuning, Alignment, AI for Science, Forensic Knowledge

195. ❌ ESCAPE: Episodic Spatial Memory and Adaptive Execution Policy for Long-Horizon Mobile Manipulation

作者: Jingjing Qian, Zeyuan He, Chen Shi, Lei Xiao, Li Jiang 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13633v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是具身AI（Embodied AI）中的移动操作（Mobile Manipulation）问题，具体提出了一个名为ESCAPE的系统，用于在复杂室内环境中进行长视野任务。论文的核心技术包括：时空融合建图（Spatio-Temporal Fusion Mapping）、记忆驱动的目标定位（Memory-Driven Target Grounding）和自适应执行策略（Adaptive Execution Policy）。这些内容主要涉及机器人学、计算机视觉、强化学习和任务规划领域。论文中完全没有提及或使用任何大语言模型（LLM）、深度学习技术原理创新（如MoE、Scaling Laws、各种训练/微调/对齐技术、推理优化、模型压缩等）或AI for Science的具体应用（如生物信息学）。因此，所有关键词均与论文内容完全无关，得分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一个名为ESCAPE的具身AI系统，通过时空融合建图和自适应执行策略解决了长视野移动操作任务中的灾难性遗忘和空间不一致问题，在ALFRED基准测试中取得了最先进的性能。

摘要翻译

在复杂室内环境中实现具有鲁棒性能的导航与操作协调，对于具身人工智能至关重要。然而，当任务时间跨度较长时，现有方法常因灾难性遗忘、空间不一致性及执行僵化等问题而难以应对。为解决这些挑战，我们提出了ESCAPE（结合自适应执行策略的情景式空间记忆系统），其通过一个紧密耦合的感知-定位-执行工作流程运行。为实现鲁棒感知，ESCAPE配备了一个时空融合建图模块，能够以自回归方式构建无需深度信息的持久性3D空间记忆，同时结合一个记忆驱动的目标定位模块以生成精确的交互掩码。为实现灵活行动，我们的自适应执行策略动态协调主动的全局导航与反应式的局部操作，以捕捉时机敏感的目标。在ALFRED基准测试中，ESCAPE取得了最先进的性能，在逐步指令引导下，于测试可见环境与不可见环境中分别达到65.09%和60.79%的成功率。通过减少冗余探索，我们的方法在路径长度加权指标上获得显著提升，并在长跨度任务中即使缺乏详细指导仍保持鲁棒性能（61.24% / 56.04%）。

摘要 (Abstract)

Coordinating navigation and manipulation with robust performance is essential for embodied AI in complex indoor environments. However, as tasks extend over long horizons, existing methods often struggle due to catastrophic forgetting, spatial inconsistency, and rigid execution. To address these issues, we propose ESCAPE (Episodic Spatial Memory Coupled with an Adaptive Policy for Execution), operating through a tightly coupled perception-grounding-execution workflow. For robust perception, ESCAPE features a Spatio-Temporal Fusion Mapping module to autoregressively construct a depth-free, persistent 3D spatial memory, alongside a Memory-Driven Target Grounding module for precise interaction mask generation. To achieve flexible action, our Adaptive Execution Policy dynamically orchestrates proactive global navigation and reactive local manipulation to seize opportunistic targets. ESCAPE achieves state-of-the-art performance on the ALFRED benchmark, reaching 65.09% and 60.79% success rates in test seen and unseen environments with step-by-step instructions. By reducing redundant exploration, our ESCAPE attains substantial improvements in path-length-weighted metrics and maintains robust performance (61.24% / 56.04%) even without detailed guidance for long-horizon tasks.

关键词: Embodied AI, Mobile Manipulation, Long-Horizon Tasks, Spatio-Temporal Fusion Mapping, Adaptive Execution Policy, ALFRED Benchmark, Spatial Memory, Navigation and Manipulation Coordination

196. ❌ What Are We Really Measuring? Rethinking Dataset Bias in Web-Scale Natural Image Collections via Unsupervised Semantic Clustering

作者: Amir Hossein Saleknia, Mohammad Sabokrou 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13610v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究计算机视觉领域中的数据集偏差评估方法，具体关注大规模自然图像集合中基于分类的偏差量化方法的缺陷。论文核心内容是：1）揭示传统监督分类方法会因分辨率伪影等非语义线索而高估数据集间的语义差异；2）提出一种基于无监督聚类的新评估框架来测量真实的语义可分性。所有评分关键词均与大模型、深度学习技术原理或AI在科学领域的应用直接相关，而本论文专注于计算机视觉中的数据集评估方法学问题，不涉及大模型技术、深度学习创新或AI在科学领域的应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文揭示了在评估大规模自然图像数据集偏差时，传统的监督分类方法会因分辨率伪影等非语义线索而严重高估语义差异，并提出了一种无监督聚类框架来更准确地测量真实的语义可分性。

摘要翻译

在计算机视觉领域，量化数据集偏差的一种主流方法是训练模型以区分不同数据集。高分类准确率随后被解释为存在有意义语义差异的证据。该方法假设标准图像增强技术能有效抑制低层次的非语义线索，因此任何残留的性能必然反映真实的语义差异。我们证明，这一基本假设在大规模自然图像集合的领域中存在缺陷。高分类准确率通常由基于分辨率的伪影驱动，这些伪影源于原始图像分辨率分布及调整大小时产生的插值效应所形成的结构性指纹。尽管经过常规的图像破坏处理，这些伪影仍能形成鲁棒的、数据集特定的特征模式。通过受控实验，我们证明模型即使在非语义的程序生成图像上也能实现强大的数据集分类，这证实了模型对表层线索的依赖。为解决此问题，我们重新审视了数据集可分离性这一已有数十年历史的概念，但并非采用监督分类方法。相反，我们引入了一种无监督方法来衡量真实的语义可分离性。我们的框架通过聚类来自基础视觉模型的富含语义的特征，直接评估语义相似性，刻意避免对数据集标签进行监督分类。当应用于本研究主要关注的大型网络规模数据集时，监督方法所报告的高可分离性基本消失，聚类准确率降至接近随机水平。这表明传统的基于分类的评估方法系统性地、且以压倒性幅度夸大了语义偏差。

摘要 (Abstract)

In computer vision, a prevailing method for quantifying dataset bias is to train a model to distinguish between datasets. High classification accuracy is then interpreted as evidence of meaningful semantic differences. This approach assumes that standard image augmentations successfully suppress low-level, non-semantic cues, and that any remaining performance must therefore reflect true semantic divergence. We demonstrate that this fundamental assumption is flawed within the domain of large-scale natural image collections. High classification accuracy is often driven by resolution-based artifacts, which are structural fingerprints arising from native image resolution distributions and interpolation effects during resizing. These artifacts form robust, dataset-specific signatures that persist despite conventional image corruptions. Through controlled experiments, we show that models achieve strong dataset classification even on non-semantic, procedurally generated images, proving their reliance on superficial cues. To address this issue, we revisit this decades-old idea of dataset separability, but not with supervised classification. Instead, we introduce an unsupervised approach that measures true semantic separability. Our framework directly assesses semantic similarity by clustering semantically-rich features from foundational vision models, deliberately bypassing supervised classification on dataset labels. When applied to major web-scale datasets, the primary focus of this work, the high separability reported by supervised methods largely vanishes, with clustering accuracy dropping to near-chance levels. This reveals that conventional classification-based evaluation systematically overstates semantic bias by an overwhelming margin.

关键词: dataset bias, natural image collections, unsupervised semantic clustering, resolution artifacts, semantic separability, supervised classification, foundational vision models, web-scale datasets

197. ❌ VGGT-Segmentor: Geometry-Enhanced Cross-View Segmentation

作者: Yulu Gao, Bohao Zhang, Zongheng Tang, Jitong Liao, Wenjun Wu, Si Liu 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13596v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的跨视角实例分割任务，提出了一种结合几何建模和语义分割的框架VGGT-Segmentor。虽然研究背景中提到关注大模型和深度学习在科学领域的应用，但论文内容完全不涉及任何大语言模型（LLM）相关技术。所有评分关键词均与大语言模型技术、训练方法、推理优化、对齐技术、代理系统等LLM相关主题相关，而本文研究的是纯粹的计算机视觉问题，使用视觉几何变换和分割技术，与LLM领域无任何交集。因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文解决了跨视角（第一人称和第三人称）实例分割中因尺度、视角和遮挡变化导致的像素级匹配困难问题，提出了VGGT-Segmentor框架，通过几何增强和联合分割头实现了最先进的性能，在Ego-Exo4D基准测试中显著超越了先前方法。

摘要翻译

跨不同第一人称与第三人称视角的实例级物体分割是视觉理解中的基础性挑战，对具身人工智能和远程协作应用至关重要。由于尺度、视角和遮挡的剧烈变化会破坏直接的像素级匹配，该任务异常困难。尽管近期如VGGT等几何感知模型为特征对齐提供了坚实基础，但我们发现即使其内部物体级注意力保持稳定，显著的像素级投影漂移仍常导致其在密集预测任务中失效。为弥合这一差距，我们提出了VGGT-Segmentor（VGGT-S）框架，将鲁棒的几何建模与像素级精确的语义分割相统一。VGGT-S利用VGGT强大的跨视角特征表示能力，并引入了一种新颖的联合分割头。该分割头通过三个阶段运作：掩码提示融合、点引导预测和迭代掩码优化，从而将高层次的特征对齐有效转化为精确的分割掩码。此外，我们提出了一种单图像自监督训练策略，无需配对标注即可实现强大的泛化能力。在Ego-Exo4D基准测试中，VGGT-S取得了新的最优性能，在第一人称到第三人称及第三人称到第一人称任务中分别实现了67.7%和68.0%的平均交并比，显著超越了现有方法。值得注意的是，我们无需对应关系预训练的模型超越了大多数全监督基线，证明了该方法的有效性和可扩展性。

摘要 (Abstract)

Instance-level object segmentation across disparate egocentric and exocentric views is a fundamental challenge in visual understanding, critical for applications in embodied AI and remote collaboration. This task is exceptionally difficult due to severe changes in scale, perspective, and occlusion, which destabilize direct pixel-level matching. While recent geometry-aware models like VGGT provide a strong foundation for feature alignment, we find they often fail at dense prediction tasks due to significant pixel-level projection drift, even when their internal object-level textntion remains consistent. To bridge this gap, we introduce VGGT-Segmentor (VGGT-S), a framework that unifies robust geometric modeling with pixel-accurate semantic segmentation. VGGT-S leverages VGGT’s powerful cross-view feature representation and introduces a novel Union Segmentation Head. This head operates in three stages: mask prompt fusion, point-guided prediction, and iterative mask refinement, effectively translating high-level feature alignment into a precise segmentation mask. Furthermore, we propose a single-image self-supervised training strategy that eliminates the need for paired annotations and enables strong generalization. On the Ego-Exo4D benchmark, VGGT-S sets a new state-of-the-art, achieving 67.7% and 68.0% average IoU for Ego to Exo and Exo to Ego tasks, respectively, significantly outperforming prior methods. Notably, our correspondence-free pretrained model surpasses most fully-supervised baselines, demonstrating the effectiveness and scalability of our approach.

关键词: cross-view segmentation, instance segmentation, geometry-aware models, VGGT, self-supervised training, Ego-Exo4D benchmark, feature alignment, mask refinement

198. ❌ Dehaze-then-Splat: Generative Dehazing with Physics-Informed 3D Gaussian Splatting for Smoke-Free Novel View Synthesis

作者: Yuchao Chen, Hanqing Wang 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13589v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10
“Scaling Laws” AND “Data Quality”	1.0	0.0/10
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10
“World Models” AND “General World Models”	1.0	0.0/10
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10

评分理由: 论文《Dehaze-then-Splat: Generative Dehazing with Physics-Informed 3D Gaussian Splatting for Smoke-Free Novel View Synthesis》专注于计算机视觉和3D重建领域，具体研究多视角烟雾去除和新视角合成。它使用生成式去雾（Nano Banana Pro）和3D高斯泼溅（3DGS）技术，结合物理信息辅助损失（如深度监督、暗通道先验正则化）。论文的核心是图像处理、3D重建和去雾，不涉及大模型、深度学习技术原理、AI for Science或其他评分关键词中的主题。所有关键词均与大模型、深度学习技术、AI科学应用等相关，而本文是纯粹的计算机视觉应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种两阶段管道（Dehaze-then-Splat），用于多视角烟雾去除和新视角合成，通过生成式去雾和物理信息3D高斯泼溅，有效缓解了去雾与3D重建之间的不一致性，在Akikaze验证场景中实现了20.98 dB PSNR和0.683 SSIM的性能提升。

摘要翻译

本文提出Dehaze-then-Splat，一种用于多视角烟雾去除与新视角合成的两阶段流程，专为NTIRE 2026三维修复与重建挑战赛第二赛道设计。第一阶段，我们通过基于Nano Banana Pro的单帧生成式去雾方法生成伪清晰训练图像，并进行亮度归一化处理。第二阶段，我们采用融合物理先验的辅助损失函数训练三维高斯泼溅（3DGS）——包括通过伪深度图进行皮尔逊相关性约束的深度监督、暗通道先验正则化以及双源梯度匹配——这些损失函数能够补偿逐帧生成处理中固有的跨视角不一致性问题。我们揭示了“先去雾后重建”流程中的根本矛盾：单幅图像恢复质量无法保证多视角一致性，这种不一致性会表现为下游三维重建中的渲染模糊与结构失稳。分析表明，基于MCMC的提前终止点云致密化策略，结合深度与烟雾抑制先验，能有效缓解这些伪影。在Akikaze验证场景中，本流程在新视角合成任务上取得了20.98分贝峰值信噪比（PSNR）与0.683结构相似性指数（SSIM），较未正则化基线提升1.50分贝。

摘要 (Abstract)

We present Dehaze-then-Splat, a two-stage pipeline for multi-view smoke removal and novel view synthesis developed for Track~2 of the NTIRE 2026 3D Restoration and Reconstruction Challenge. In the first stage, we produce pseudo-clean training images via per-frame generative dehazing using Nano Banana Pro, followed by brightness normalization. In the second stage, we train 3D Gaussian Splatting (3DGS) with physics-informed auxiliary losses – depth supervision via Pearson correlation with pseudo-depth, dark channel prior regularization, and dual-source gradient matching – that compensate for cross-view inconsistencies inherent in frame-wise generative processing. We identify a fundamental tension in dehaze-then-reconstruct pipelines: per-image restoration quality does not guarantee multi-view consistency, and such inconsistency manifests as blurred renders and structural instability in downstream 3D reconstruction.Our analysis shows that MCMC-based densification with early stopping, combined with depth and haze-suppression priors, effectively mitigates these artifacts. On the Akikaze validation scene, our pipeline achieves 20.98,dB PSNR and 0.683 SSIM for novel view synthesis, a +1.50,dB improvement over the unregularized baseline.

关键词: generative dehazing, 3D Gaussian Splatting, novel view synthesis, smoke removal, multi-view consistency, physics-informed losses, depth supervision, MCMC-based densification

199. ❌ Efficient Multi-View 3D Object Detection by Dynamic Token Selection and Fine-Tuning

作者: Danish Nazir, Antoine Hanna-Asaad, Lucas Görnhardt, Jan Piewek, Thorsten Bagdonat, Tim Fingscheidt 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13586v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的高效多视角3D目标检测，采用Vision Transformer（ViT）作为骨干网络。论文的核心创新在于：1）提出动态层间token选择机制以加速推理；2）引入参数高效微调策略，将微调参数量从3亿减少到160万。这些技术与关键词’PEFT OR LoRA OR Parameter-efficient Fine-tuning’高度相关（评分10分），因为论文明确提出了参数高效微调方法并大幅减少了训练参数。然而，论文完全不涉及大语言模型（LLM）、自然语言处理、推理、对齐、科学AI应用等其他关键词领域，因此其他所有关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对多视角3D目标检测中Vision Transformer计算复杂的问题，提出了动态token选择和参数高效微调方法，在NuScenes数据集上实现了计算复杂度降低48%-55%、推理延迟减少9%-25%，同时检测精度提升1.0%-2.8%。

摘要翻译

现有多视图三维物体检测方法普遍采用基于大规模预训练视觉Transformer的基础模型作为骨干网络，计算复杂度较高。针对该问题，当前最高效的基于多视图ViT的三维物体检测方法\texttt{ToC3D}采用了基于自运动的相关令牌选择机制。然而，该方法存在两个关键局限：（1）固定的逐层令牌选择比例限制了训练和推理过程中的计算效率；（2）该方法需要对ViT骨干网络进行完整的端到端重新训练。本研究提出一种结合令牌选择机制的图像令牌补偿器，用于加速基于ViT骨干网络的多视图三维物体检测。与\texttt{ToC3D}不同，我们的方法能够在ViT骨干网络内部实现动态的逐层令牌选择。此外，我们引入了一种参数高效的微调策略，仅训练所提出的模块，从而将微调参数量从超过3亿个减少至仅160万个。在NuScenes大规模数据集上对三种多视图三维物体检测方法的实验表明：与当前最优的\texttt{ToC3D}方法相比，我们提出的方法将计算复杂度（GFLOPs）降低了48%至55%，在\texttt{NVIDIA-GV100} GPU上的推理延迟降低了9%至25%，同时将平均精度均值绝对值提升1.0%至2.8%，并将NuScenes检测分数绝对值提升0.4%至1.2%。

摘要 (Abstract)

Existing multi-view three-dimensional (3D) object detection approaches widely adopt large-scale pre-trained vision transformer (ViT)-based foundation models as backbones, being computationally complex. To address this problem, current state-of-the-art (SOTA) \texttt{ToC3D} for efficient multi-view ViT-based 3D object detection employs ego-motion-based relevant token selection. However, there are two key limitations: (1) The fixed layer-individual token selection ratios limit computational efficiency during both training and inference. (2) Full end-to-end retraining of the ViT backbone is required for the multi-view 3D object detection method. In this work, we propose an image token compensator combined with a token selection for ViT backbones to accelerate multi-view 3D object detection. Unlike \texttt{ToC3D}, our approach enables dynamic layer-wise token selection within the ViT backbone. Furthermore, we introduce a parameter-efficient fine-tuning strategy, which trains only the proposed modules, thereby reducing the number of fine-tuned parameters from more than $300$ million (M) to only $1.6$ M. Experiments on the large-scale NuScenes dataset across three multi-view 3D object detection approaches demonstrate that our proposed method decreases computational complexity (GFLOPs) by $48%$ … $55%$, inference latency (on an \texttt{NVIDIA-GV100} GPU) by $9%$ … $25%$, while still improving mean average precision by $1.0%$ … $2.8%$ absolute and NuScenes detection score by $0.4%$ … $1.2%$ absolute compared to so-far SOTA \texttt{ToC3D}.

关键词: Multi-view 3D object detection, Vision Transformer, Dynamic token selection, Parameter-efficient fine-tuning, Computational efficiency, NuScenes dataset, Inference acceleration, Token compensator

200. ❌ SocialMirror: Reconstructing 3D Human Interaction Behaviors from Monocular Videos with Semantic and Geometric Guidance

作者: Qi Xia, Peishan Cong, Ziyi Wang, Yujing Sun, Qin Sun, Xinge Zhu, Mao Ye, Ruigang Yang, Yuexin Ma 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13581v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《SocialMirror: Reconstructing 3D Human Interaction Behaviors from Monocular Videos with Semantic and Geometric Guidance》专注于计算机视觉和图形学领域，具体研究从单目视频中重建3D人体交互行为。虽然使用了基于扩散的框架和视觉语言模型（VLM）生成高层交互描述，但核心内容与深度学习在视觉任务中的应用相关，而非大模型（LLM）技术原理、训练方法、推理优化、对齐、代理系统或科学AI应用。所有关键词均涉及大模型相关技术或应用，与论文主题无直接关联，因此所有关键词相关度评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为SocialMirror的扩散框架，通过整合语义和几何线索，解决了从单目视频中重建紧密交互场景下3D人体行为的挑战，实现了最先进的性能并展示了良好的泛化能力。

摘要翻译

在近距离交互场景中精确重建人体行为，对于实现增强现实中逼真的虚拟交互、体育领域的精准运动分析以及人机协作任务中的自然协作行为至关重要。在这些场景中进行可靠的重建，能显著提升人工智能驱动交互应用的真实感与有效性。然而，基于单目视频进行近距离交互场景下的人体重建仍面临严峻挑战，主要源于严重的相互遮挡所导致的局部运动模糊性、时序连续性破坏以及空间关系误差。本文提出SocialMirror，一个基于扩散模型的框架，通过整合语义与几何线索来有效应对上述问题。具体而言，我们首先利用视觉语言模型生成的高层交互描述来指导一个语义引导的运动补全模块，以推测被遮挡的身体部位并解决局部姿态模糊性问题。接着，我们提出一个序列级时序优化器，用于强制生成平滑、无抖动的运动序列，同时在采样过程中融入几何约束，以确保合理的接触关系与空间布局。在多个交互基准测试上的评估表明，SocialMirror在重建交互人体网格方面达到了最先进的性能，并在未见过的数据集和真实场景中展现出强大的泛化能力。相关代码将在论文发表后开源。

摘要 (Abstract)

Accurately reconstructing human behavior in close-interaction scenarios is crucial for enabling realistic virtual interactions in augmented reality, precise motion analysis in sports, and natural collaborative behavior in human-robot tasks. Reliable reconstruction in these contexts significantly enhances the realism and effectiveness of AI-driven interactive applications. However, human reconstruction from monocular videos in close-interaction scenarios remains challenging due to severe mutual occlusions, leading local motion ambiguity, disrupted temporal continuity and spatial relationship error. In this paper, we propose SocialMirror, a diffusion-based framework that integrates semantic and geometric cues to effectively address these issues. Specifically, we first leverage high-level interaction descriptions generated by a vision-language model to guide a semantic-guided motion infiller, hallucinating occluded bodies and resolving local pose ambiguities. Next, we propose a sequence-level temporal refiner that enforces smooth, jitter-free motions, while incorporating geometric constraints during sampling to ensure plausible contact and spatial relationships. Evaluations on multiple interaction benchmarks show that SocialMirror achieves state-of-the-art performance in reconstructing interactive human meshes, demonstrating strong generalization across unseen datasets and in-the-wild scenarios. The code will be released upon publication.

关键词: 3D human reconstruction, monocular video, human interaction, diffusion model, semantic guidance, geometric constraints, occlusion handling, motion analysis

201. ❌ Radar-Informed 3D Multi-Object Tracking under Adverse Conditions

作者: Bingxue Xu, Emil Hedemalm, Ajinkya Khoche, Patric Jensfelt 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13571v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是3D多目标跟踪（3D MOT）中的雷达传感器融合技术，属于计算机视觉和自动驾驶领域。所有评分关键词都专注于大语言模型（LLM）及其相关技术（如训练方法、推理优化、对齐、代理系统等），而本文完全不涉及任何语言模型、深度学习模型训练或大模型技术。论文的核心是雷达点云数据处理和传感器融合算法，与评分关键词列表中的任何主题都没有直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种雷达增强的3D多目标跟踪框架RadarMOT，通过显式利用雷达点云数据来改进状态估计并恢复远距离检测遗漏，在MAN-TruckScenes数据集上显著提升了长距离和恶劣天气条件下的跟踪精度。

摘要翻译

三维多目标跟踪（3D MOT）面临的挑战在于实现实际应用中的鲁棒性，例如在恶劣条件下以及随着距离增加保持跟踪一致性。为应对这些挑战，融合激光雷达（LiDAR）、相机和雷达的传感器融合方法应运而生。然而，现有的多模态融合方法通常仅将雷达作为网络内部另一种可学习的特征进行处理。当整体模型在困难环境条件下性能下降时，雷达本可提供的鲁棒性优势也随之减弱。我们提出RadarMOT，一种雷达信息增强的三维多目标跟踪框架，其显式利用雷达点云数据作为额外观测，以优化状态估计并恢复远距离下的检测漏报。在MAN-TruckScenes数据集上的评估表明，RadarMOT能持续提升平均多目标跟踪精度（AMOTA），在远距离场景下绝对提升12.7%，在恶劣天气下绝对提升10.3%。代码将在https://github.com/bingxue-xu/radarmot公开。

摘要 (Abstract)

The challenge of 3D multi-object tracking (3D MOT) is achieving robustness in real-world applications, for example under adverse conditions and maintaining consistency as distance increases. To overcome these challenges, sensor fusion approaches that combine LiDAR, cameras, and radar have emerged. However, existing multi-modal fusion methods usually treat radar as another learned feature inside the network. When the overall model degrades in difficult environmental conditions, the robustness advantages that radar could provide are also reduced. We propose RadarMOT, a radar-informed 3D MOT framework that explicitly uses radar point cloud data as additional observation to refine state estimation and recover detector misses at long ranges. Evaluations on the MAN-TruckScenes dataset show that RadarMOT consistently improves the Average Multi-Object Tracking Accuracy (AMOTA) with absolute 12.7% at long range and 10.3% in adverse weather. The code will be available at https://github.com/bingxue-xu/radarmot

关键词: 3D multi-object tracking, radar point cloud, sensor fusion, adverse conditions, long-range tracking, state estimation, detector misses, AMOTA improvement

202. ❌ ZoomSpec: A Physics-Guided Coarse-to-Fine Framework for Wideband Spectrum Sensing

作者: Zhentao Yang, Yixiang Luomei, Zhuoyang Liu, Zhenyu Liu, Feng Xu 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13568v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于无线通信领域的宽频带频谱感知，提出了一种结合信号处理先验和深度学习的物理引导粗到细框架。论文的核心是信号处理、频谱分析和深度学习在特定工程领域的应用，而非大语言模型（LLM）或通用大模型技术。所有关键词中，只有’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因为论文将AI（深度学习）应用于科学/工程问题（频谱感知），但并非生物信息学或化学信息学领域，因此给5分。其他关键词均与大语言模型、模型训练、对齐、推理优化、智能体等无关，故评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对低空监测中宽频带频谱感知的挑战，提出了ZoomSpec框架，通过物理引导的粗到细方法结合信号处理与深度学习，在真实数据集上实现了78.1 mAP@0.5:0.95的先进性能。

摘要翻译

低空监测中的宽带频谱感知至关重要，但由于异构协议、大带宽和非平稳信噪比而极具挑战性。现有数据驱动方法将频谱图视为自然图像，存在域不匹配问题：它们忽略了时频分辨率约束和频谱泄漏，导致窄带信号可见性差。本文提出ZoomSpec，一种物理引导的从粗到精框架，将信号处理先验知识与深度学习相结合。我们引入对数空间短时傅里叶变换（Log-Space STFT, LS-STFT）以克服线性频谱图的几何瓶颈，在保持恒定相对分辨率的同时锐化窄带结构。一个轻量级粗检测网络（Coarse Proposal Net, CPN）快速扫描全频段。为衔接粗检测与精细识别，我们设计了自适应外差低通（Adaptive Heterodyne Low-Pass, AHLP）模块，执行中心频率对齐、带宽匹配滤波和安全抽取，从而净化信号并抑制带外干扰。精细识别网络（Fine Recognition Net, FRN）通过双域注意力机制融合净化后的时域I/Q信号与频谱幅度，共同优化时域边界估计和调制分类。在SpaceNet真实数据集上的评估表明，本方法达到了78.1 mAP@0.5:0.95的先进性能，超越了现有榜单系统，并在不同调制带宽下展现出卓越的稳定性。

摘要 (Abstract)

Wideband spectrum sensing for low-altitude monitoring is critical yet challenging due to heterogeneous protocols,large bandwidths, and non-stationary SNR. Existing data-driven approaches treat spectrograms as natural images,suffering from domain mismatch: they neglect time-frequency resolution constraints and spectral leakage, leading topoor narrowband visibility. This paper proposes ZoomSpec, a physics-guided coarse-to-fine framework integrating signal processing priors with deep learning. We introduce a Log-Space STFT (LS-STFT) to overcome the geometric bottleneck of linear spectrograms, sharpening narrowband structures while maintaining constant relative resolution. A lightweight Coarse Proposal Net (CPN) rapidly screens the full band. To bridge coarse detection and fine recognition, we design an Adaptive Heterodyne Low-Pass (AHLP) module that executes center-frequency aligning, bandwidth-matched filtering, and safe decimation, purifying signals of out-of-band interference. A Fine Recognition Net (FRN) fuses purified time-domain I/Q with spectral magnitude via dual-domain attention to jointly refine temporal boundaries and modulation classification. Evaluations on the SpaceNet real-world dataset demonstrate state-of-the-art 78.1 mAP@0.5:0.95, surpassing existing leaderboard systems with superior stability across diverse modulation bandwidths.

关键词: Wideband spectrum sensing, Physics-guided deep learning, Coarse-to-fine framework, Log-Space STFT, Adaptive Heterodyne Low-Pass, Dual-domain attention, Modulation classification, Low-altitude monitoring

203. ❌ AI Powered Image Analysis for Phishing Detection

作者: K. Acharya, S. Ale, R. Kadel 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13555v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于使用深度学习视觉模型（ConvNeXt-Tiny和Vision Transformer）进行基于图像的钓鱼网站检测，属于计算机视觉应用领域。论文内容涉及深度学习、图像分析、钓鱼检测、模型比较和阈值优化，但完全不涉及大语言模型（LLM）、大模型技术原理、科学AI应用或任何评分关键词中列出的具体技术（如MoE、Scaling Laws、RLHF、RAG、Agent等）。所有关键词均与大模型或深度学习技术原理相关，而本文是纯视觉模型应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该研究提出了一种基于深度学习的图像分析方法，使用ConvNeXt-Tiny和Vision Transformer模型通过网页截图检测钓鱼网站，发现ConvNeXt-Tiny在F1分数和计算效率上表现更优，并强调了阈值调优对实际部署的重要性。

摘要翻译

当前，钓鱼网站为规避基于文本与URL的检测系统，已高度依赖视觉仿冒手段——复制品牌标识、采用相似布局及匹配配色。本文提出一种基于深度学习的方法，利用网页截图进行图像式钓鱼检测。研究测试了两种视觉模型（ConvNeXt-Tiny与Vision Transformer（ViT-Base））处理视觉欺骗性钓鱼页面的效能。该框架涵盖数据集构建、预处理、基于ImageNet权重的迁移学习，以及采用不同决策阈值的评估流程。结果表明，ConvNeXt-Tiny在整体性能上表现最优，在优化阈值下取得最高的F1分数，且运行效率优于ViT-Base。这凸显了卷积模型在视觉钓鱼检测中的优势，并证明了阈值调优在实际部署中的重要性。作为未来工作，本研究构建的数据集将公开以支持结果复现并推动该领域进一步探索。与多数现有研究主要关注准确率不同，本文更强调基于阈值的评估体系，以更贴合实际部署场景。通过考察不同决策阈值下的精确率、召回率与F1分数，研究确定了平衡检测性能与误报控制的最佳操作点。此外，在相同实验设置下对ConvNeXt-Tiny与ViT-Base的并行比较，为理解卷积架构与基于Transformer的架构在视觉钓鱼检测任务中的鲁棒性与计算效率差异提供了实践性见解。

摘要 (Abstract)

Phishing websites now rely heavily on visual imitation-copied logos, similar layouts, and matching colours-to avoid detection by text- and URL-based systems. This paper presents a deep learning approach that uses webpage screenshots for image-based phishing detection. Two vision models, ConvNeXt-Tiny and Vision Transformer (ViT-Base), were tested to see how well they handle visually deceptive phishing pages. The framework covers dataset creation, preprocessing, transfer learning with ImageNet weights, and evaluation using different decision thresholds. The results show that ConvNeXt-Tiny performs the best overall, achieving the highest F1-score at the optimised threshold and running more efficiently than ViT-Base. This highlights the strength of convolutional models for visual phishing detection and shows why threshold tuning is important for real-world deployment. As future work, the curated dataset used in this study will be released to support reproducibility and encourage further research in this area. Unlike many existing studies that primarily report accuracy, this work places greater emphasis on threshold-aware evaluation to better reflect real-world deployment conditions. By examining precision, recall, and F1-score across different decision thresholds, the study identifies operating points that balance detection performance and false-alarm control. In addition, the side-by-side comparison of ConvNeXt-Tiny and ViT-Base under the same experimental setup offers practical insights into how convolutional and transformer-based architectures differ in robustness and computational efficiency for visual phishing detection.

关键词: phishing detection, deep learning, image analysis, ConvNeXt-Tiny, Vision Transformer, threshold tuning, F1-score, computational efficiency

204. ❌ Reconstruction of a 3D wireframe from a single line drawing via generative depth estimation

作者: Elton Cao, Hod Lipson 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13549v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于计算机视觉中的3D重建任务，使用生成式深度估计方法（Latent Diffusion Model with ControlNet），属于深度学习在特定工程应用中的研究。论文与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、智能体等）完全无关，仅与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），因为其应用可视为AI在科学/工程领域（计算机视觉、数字制造）的应用，但并非核心的生物信息学或化学信息学。

!!! tip deepseek-chat TL;DR

该研究解决了从单张2D线稿重建3D线框模型的挑战，通过提出一种基于条件深度估计的生成方法（使用潜在扩散模型和ControlNet框架），实现了从稀疏2D草图到密集3D表示的鲁棒转换。

摘要翻译

将二维手绘草图转化为三维模型仍是计算机视觉领域的关键挑战，它连接着人类创造力与数字制造之间的鸿沟。传统的线稿重建方法依赖脆弱的符号逻辑，而现代方法则受限于刚性参数化建模，将用户束缚于预定义的CAD基元。我们提出一种生成式方法，将重建任务构建为条件密集深度估计问题。为实现这一目标，我们采用基于ControlNet架构的条件框架构建潜在扩散模型（LDM），以解决正交投影固有的模糊性问题。为支持迭代式“草图-重建-草图”工作流，我们引入基于图的广度优先搜索（BFS）掩码策略来模拟局部深度线索。我们使用源自ABC数据集、包含超百万张图像-深度对的大规模数据集进行训练与评估。该框架在不同复杂度的形状上均展现出鲁棒性能，为稀疏二维线稿到密集三维表征的转换提供了可扩展的流程，使用户能够摆脱传统CAD的刚性约束，真正实现“在三维空间中绘图”。

摘要 (Abstract)

The conversion of 2D freehand sketches into 3D models remains a pivotal challenge in computer vision, bridging the gap between human creativity and digital fabrication. Traditional line drawing reconstruction relies on brittle symbolic logic, while modern approaches are constrained by rigid parametric modeling, limiting users to predefined CAD primitives. We propose a generative approach by framing reconstruction as a conditional dense depth estimation task. To achieve this, we implement a Latent Diffusion Model (LDM) with a ControlNet-style conditioning framework to resolve the inherent ambiguities of orthographic projections. To support an iterative “sketch-reconstruct-sketch” workflow, we introduce a graph-based BFS masking strategy to simulate partial depth cues. We train and evaluate our approach using a massive dataset of over one million image-depth pairs derived from the ABC Dataset. Our framework demonstrates robust performance across varying shape complexities, providing a scalable pipeline for converting sparse 2D line drawings into dense 3D representations, effectively allowing users to “draw in 3D” without the rigid constraints of traditional CAD.

关键词: 3D reconstruction, line drawing, depth estimation, generative model, latent diffusion model, ControlNet, computer vision, CAD

205. ❌ Evolvable Embodied Agent for Robotic Manipulation via Long Short-Term Reflection and Optimization

作者: Jianzong Wang, Botao Zhao, Yayun He, Junqing Peng, Xulong Zhang 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13533v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	5.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出EEAgent框架，利用大型视觉语言模型（VLMs）进行环境解释和策略规划，属于大模型在机器人领域的应用。核心创新是LSTRO机制，通过反思过去经验和新学到的教训来动态优化提示，实现持续自我进化。这与’LLM Agents’和’Self-Correction’高度相关（10分），因为论文研究自主代理的自我改进能力。与’Large Language Models’相关（8分），因为使用了VLMs（视觉语言模型是LLMs的一种）。与’Chain of Thought’、‘System 2 Thinking’和’In-context Learning’有一定关联（5分），因为反思和优化过程涉及多步推理和上下文学习。其他关键词如MoE、SFT、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种可进化具身代理（EEAgent）框架，通过长短期反思优化（LSTRO）机制利用大型视觉语言模型进行环境解释和策略规划，在VIMA-Bench任务上实现了最先进的性能，显著提高了复杂场景下的任务成功率。

摘要翻译

实现通用机器人技术需要赋予机器人基于环境与反馈进行自适应与进化的能力。传统方法面临诸多局限，包括大量训练需求、跨任务泛化困难以及缺乏可解释性。提示学习为机器人的自我进化提供了新机遇，使其无需大量训练，仅通过对过往经验的反思即可实现进化。然而，如何从任务的成功与失败中提取有意义的洞察仍是一个挑战。为此，我们提出了可进化具身智能体（EEAgent）框架，该框架利用大规模视觉-语言模型（VLMs）以提升环境理解与策略规划能力。为加强对过往经验的反思，我们提出了长短时反思优化（LSTRO）机制，该机制基于历史经验与新习得的教训动态优化提示，促进持续自我进化，从而提升整体任务成功率。在六项VIMA-Bench任务上的评估表明，我们的方法取得了新的最优性能，尤其在复杂场景中显著超越了基线模型。

摘要 (Abstract)

Achieving general-purpose robotics requires empowering robots to adapt and evolve based on their environment and feedback. Traditional methods face limitations such as extensive training requirements, difficulties in cross-task generalization, and lack of interpretability. Prompt learning offers new opportunities for self-evolving robots without extensive training, but simply reflecting on past experiences.However, extracting meaningful insights from task successes and failures remains a challenge. To this end, we propose the evolvable embodied agent (EEAgent) framework, which leverages large vision-language models (VLMs) for better environmental interpretation and policy planning. To enhance reflection on past experiences, we propose a long short-term reflective optimization (LSTRO) mechanism that dynamically refines prompts based on both past experiences and newly learned lessons, facilitating continuous self-evolution, thereby enhancing overall task success rates. Evaluations on six VIMA-Bench tasks reveal that our approach sets a new state-of-the-art, notably outperforming baselines in complex scenarios.

关键词: evolvable embodied agent, large vision-language models, long short-term reflective optimization, self-evolution, robotic manipulation, prompt learning, VIMA-Bench, policy planning

206. ❌ Enhancing Mixture-of-Experts Specialization via Cluster-Aware Upcycling

作者: Sanghyeok Chu, Pyunghwan Ahn, Gwangmo Song, SeungHwan Kim, Honglak Lee, Bohyung Han 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13508v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究Mixture-of-Experts（MoE）模型的初始化方法，因此与’Mixture of Experts OR MoE OR Sparse Models’高度相关（10分）。论文提到从预训练密集权重初始化MoE，与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（5分）。论文涉及大模型技术，但未明确使用LLMs，与’Large Language Models OR LLMs OR Foundation Models’有间接关联（5分）。其他关键词如SLMs、SFT、RLHF、RAG等均未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Cluster-aware Upcycling的MoE模型初始化方法，通过语义聚类和子空间表示打破专家对称性，在CLIP ViT模型上实现了优于现有方法的性能，并产生了更多样化和解耦的专家表示。

摘要翻译

稀疏升级提供了一种从预训练稠密权重初始化混合专家模型的高效方法，避免了从头开始训练。然而，由于所有专家均从相同权重开始且路由器随机初始化，模型会面临专家对称性和早期专业化受限的问题。我们提出聚类感知升级策略，将语义结构融入混合专家模型的初始化过程。该方法首先将稠密模型的输入激活划分为若干语义聚类，随后通过截断奇异值分解，利用各聚类对应的子空间表示来初始化每个专家，同时将路由器的初始权重设置为聚类中心。这种聚类感知初始化打破了专家对称性，并促进了与数据分布对齐的早期专业化。此外，我们引入了专家集成自蒸馏损失，通过集成教师模型提供可靠的路由指导以稳定训练过程。在CLIP ViT-B/32和ViT-B/16模型上的评估表明，聚类感知升级在零样本和少样本基准测试中均持续优于现有方法。所提方法还能生成更多样化、解耦的专家表示，降低专家间相似性，并引导出更确定的路由行为。

摘要 (Abstract)

Sparse Upcycling provides an efficient way to initialize a Mixture-of-Experts (MoE) model from pretrained dense weights instead of training from scratch. However, since all experts start from identical weights and the router is randomly initialized, the model suffers from expert symmetry and limited early specialization. We propose Cluster-aware Upcycling, a strategy that incorporates semantic structure into MoE initialization. Our method first partitions the dense model’s input activations into semantic clusters. Each expert is then initialized using the subspace representations of its corresponding cluster via truncated SVD, while setting the router’s initial weights to the cluster centroids. This cluster-aware initialization breaks expert symmetry and encourages early specialization aligned with the data distribution. Furthermore, we introduce an expert-ensemble self-distillation loss that stabilizes training by providing reliable routing guidance using an ensemble teacher. When evaluated on CLIP ViT-B/32 and ViT-B/16, Cluster-aware Upcycling consistently outperforms existing methods across both zero-shot and few-shot benchmarks. The proposed method also produces more diverse and disentangled expert representations, reduces inter-expert similarity, and leads to more confident routing behavior.

关键词: Mixture-of-Experts, Sparse Upcycling, Cluster-aware Upcycling, Expert Specialization, Semantic Clustering, Expert-ensemble Self-distillation, CLIP ViT, Zero-shot and Few-shot Benchmarks

207. ❌ ADP-DiT: Text-Guided Diffusion Transformer for Brain Image Generation in Alzheimer’s Disease Progression

作者: Juneyong Lee, Geonwoo Baek, Ikbeom Jang 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13495v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于医学影像生成，特别是阿尔茨海默病进展的脑部MRI合成。它使用了基于Transformer的扩散模型（DiT）和文本引导技术，但核心是医学影像生成应用，而非通用大语言模型技术。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为它属于生物信息学/医学AI应用领域，得10分。其他所有关键词均涉及大语言模型（LLM）的特定技术、训练方法、推理优化或代理系统，与论文的医学影像生成焦点完全无关，因此得0分。

!!! tip deepseek-chat TL;DR

该研究提出了ADP-DiT，一种基于文本引导扩散Transformer的模型，用于生成阿尔茨海默病进展的纵向脑部MRI图像，通过整合临床文本条件显著提高了图像质量和解剖保真度。

摘要翻译

阿尔茨海默病（Alzheimer’s disease, AD）在不同个体间呈现异质性进展，这促使研究者探索针对特定对象的随访磁共振成像（MRI）合成方法，以支持疾病进展评估。尽管新兴的基于Transformer的扩散模型——扩散Transformer（Diffusion Transformers, DiT）——为图像合成提供了可扩展的骨干网络，但如何在纵向AD MRI生成中实现对随访时间和参与者元数据的临床可解释控制，仍研究不足。本文提出ADP-DiT，一种具有间隔感知能力、基于临床文本条件的扩散Transformer，用于纵向AD MRI合成。ADP-DiT将随访间隔与多领域人口统计学信息、诊断标签（CN/MCI/AD）及神经心理学数据编码为自然语言提示，从而实现了超越粗略诊断阶段的、时间特异性的控制。为有效注入这些条件信息，我们采用双文本编码器——OpenCLIP用于视觉-语言对齐，T5用于更丰富的临床语言理解。它们的嵌入表示通过交叉注意力机制融合到DiT中，以实现细粒度引导，并通过自适应层归一化进行全局调制。我们进一步通过向图像令牌施加旋转位置编码，并在预训练的SDXL-VAE潜在空间中进行扩散，以提升解剖结构的保真度，从而实现高效的高分辨率重建。在来自712名参与者的3,321次纵向3T T1加权扫描（共259,038张图像切片）上，ADP-DiT取得了SSIM 0.8739和PSNR 29.32 dB的性能，较DiT基线分别提升了+0.1087 SSIM和+6.08 dB PSNR，同时成功捕捉了与疾病进展相关的变化，如脑室扩大和海马体萎缩。这些结果表明，将全面、个体化的临床条件信息与模型架构相结合，能够改善纵向AD MRI的合成效果。

摘要 (Abstract)

Alzheimer’s disease (AD) progresses heterogeneously across individuals, motivating subject-specific synthesis of follow-up magnetic resonance imaging (MRI) to support progression assessment. While Diffusion Transformers (DiT), an emerging transformer-based diffusion model, offer a scalable backbone for image synthesis, longitudinal AD MRI generation with clinically interpretable control over follow-up time and participant metadata remains underexplored. We present ADP-DiT, an interval-aware, clinically text-conditioned diffusion transformer for longitudinal AD MRI synthesis. ADP-DiT encodes follow-up interval together with multi-domain demographic, diagnostic (CN/MCI/AD), and neuropsychological information as a natural-language prompt, enabling time-specific control beyond coarse diagnostic stages. To inject this conditioning effectively, we use dual text encoders-OpenCLIP for vision-language alignment and T5 for richer clinical-language understanding. Their embeddings are fused into DiT through cross-attention for fine-grained guidance and adaptive layer normalization for global modulation. We further enhance anatomical fidelity by applying rotary positional embeddings to image tokens and performing diffusion in a pre-trained SDXL-VAE latent space to enable efficient high-resolution reconstruction. On 3,321 longitudinal 3T T1-weighted scans from 712 participants (259,038 image slices), ADP-DiT achieves SSIM 0.8739 and PSNR 29.32 dB, improving over a DiT baseline by +0.1087 SSIM and +6.08 dB PSNR while capturing progression-related changes such as ventricular enlargement and shrinking hippocampus. These results suggest that integrating comprehensive, subject-specific clinical conditions with architectures can improve longitudinal AD MRI synthesis.

关键词: Alzheimer’s disease, Diffusion Transformer, MRI generation, text-guided synthesis, longitudinal imaging, clinical text conditioning, brain image synthesis, AD progression

208. ❌ RadarSplat-RIO: Indoor Radar-Inertial Odometry with Gaussian Splatting-Based Radar Bundle Adjustment

作者: Pou-Chun Kung, Yuan Tian, Zhengqin Li, Yue Liu, Eric Whitmire, Wolf Kienzle, Hrvoje Benko 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13492v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于雷达惯性里程计和SLAM技术，使用高斯泼溅进行雷达束调整，属于机器人定位与建图领域。所有评分关键词均涉及大语言模型、深度学习技术原理及其应用，而本文完全不涉及这些主题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了首个基于高斯泼溅的雷达束调整框架，用于雷达惯性里程计，显著减少了姿态漂移，在室内场景中将平均绝对平移和旋转误差分别降低了90%和80%。

摘要翻译

相较于视觉与激光雷达同步定位与建图（SLAM）系统，雷达在恶劣天气与光照条件下具有更强的鲁棒性。然而，大多数雷达SLAM流程仍严重依赖于帧间里程计，这会导致显著的累积漂移。虽然回环检测能够修正长期误差，但其需要重访已探索区域，并依赖于稳健的地点识别能力。相比之下，视觉里程计方法通常利用局部窗口内的光束法平差（Bundle Adjustment, BA）来联合优化位姿与地图。然而，适用于雷达的等效BA方法在很大程度上尚未得到探索。本文提出了首个基于高斯泼溅（Gaussian Splatting, GS）——一种稠密且可微的场景表示方法——实现的雷达BA框架。我们的方法利用完整的距离-方位角-多普勒数据，联合优化雷达传感器位姿与场景几何结构，首次将多帧BA的优势引入雷达SLAM。当与现有的雷达-惯性里程计前端集成时，本方法显著降低了位姿漂移并提升了系统鲁棒性。在多个室内场景的测试中，我们的雷达BA方法相较于先前的雷达-惯性里程计取得了显著提升，平均绝对平移误差与旋转误差分别降低了90%和80%。

摘要 (Abstract)

Radar is more resilient to adverse weather and lighting conditions than visual and Lidar simultaneous localization and mapping (SLAM). However, most radar SLAM pipelines still rely heavily on frame-to-frame odometry, which leads to substantial drift. While loop closure can correct long-term errors, it requires revisiting places and relies on robust place recognition. In contrast, visual odometry methods typically leverage bundle adjustment (BA) to jointly optimize poses and map within a local window. However, an equivalent BA formulation for radar has remained largely unexplored. We present the first radar BA framework enabled by Gaussian Splatting (GS), a dense and differentiable scene representation. Our method jointly optimizes radar sensor poses and scene geometry using full range-azimuth-Doppler data, bringing the benefits of multi-frame BA to radar for the first time. When integrated with an existing radar-inertial odometry frontend, our approach significantly reduces pose drift and improves robustness. Across multiple indoor scenes, our radar BA achieves substantial gains over the prior radar-inertial odometry, reducing average absolute translational and rotational errors by 90% and 80%, respectively.

关键词: Radar SLAM, Gaussian Splatting, Bundle Adjustment, Radar-Inertial Odometry, Indoor Localization, Pose Optimization, Scene Representation, Drift Reduction

209. ❌ Enhanced Text-to-Image Generation by Fine-grained Multimodal Reasoning

作者: Yongjin Kim, Yoonjin Oh, Yerin Kim, Hyomin Kim, Jeeyoung Yun, Yujung Heo, Minjun Kim, Sungwoong Kim 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13491v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文聚焦于多模态大语言模型（MLLMs）在文本到图像生成中的应用，提出FiMR框架，通过分解的视觉问答（VQA）实现细粒度推理和自我改进。核心相关关键词包括：‘Large Language Models’（MLLMs属于大模型范畴，权重1.0，相关度10）、‘Chain of Thought’和’System 2 Thinking’（涉及多步推理和深度推理，权重各1.0，相关度10）、‘Self-Correction’（强调自我反思和自我改进，权重1.0，相关度10）。其他关键词如’Post-training’（可能涉及微调，权重1.0，相关度5）、‘Hallucination Mitigation’（改善图像-提示对齐，权重1.0，相关度5）、‘Explainable AI’（通过VQA提供反馈，权重1.0，相关度5）有一定关联。剩余关键词与论文主题无关，相关度为0。加权总分计算为：101.0 + 01.0 + 01.0 + 01.0 + 01.0 + 51.0 + 01.0 + 01.0 + 01.0 + 01.0 + 01.0 + 01.0 + 101.0 + 101.0 + 01.0 + 101.0 + 01.0 + 01.0 + 01.0 + 01.0 + 01.0 + 51.0 + 51.0 + 01.0 + 01.0 + 01.0 + 0*1.0 = 55.0。作者列表中未指定专家，无加分。

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型在文本到图像生成中缺乏细粒度控制的问题，提出了FiMR框架，通过分解的视觉问答实现细粒度推理和自我改进，实验表明其在图像-提示对齐和生成质量上优于现有基线方法。

摘要翻译

随着多模态大语言模型（MLLMs）的快速发展，能够同时执行图像理解与生成的统一MLLMs已取得显著进展。然而，尽管统一MLLMs具备用于自我反思与自我优化的内在推理能力，其在文本到图像生成任务中的应用仍很大程度上未被充分探索。同时，现有的基于多模态推理的图像生成方法大多依赖于整体图像-文本对齐判断，缺乏对提示词细节属性的细粒度反思与优化，导致细粒度控制能力有限。为此，我们提出细粒度多模态推理（Fine-grained Multimodal Reasoning, FiMR）框架，该框架利用分解式视觉问答（VQA）将输入提示拆解为最小语义单元（如实体与属性），并通过VQA验证每个单元以生成显式的细粒度反馈。基于此反馈，FiMR随后实施有针对性的局部优化。这种细粒度的自我推理与自我优化使MLLMs能够在测试时实现图像-提示对齐与整体生成质量的更精准提升。大量实验表明，FiMR在图像生成基准测试（包括基于推理的方法）中持续优于现有基线，尤其在组合式文本到图像基准上表现突出。

摘要 (Abstract)

With the rapid progress of Multimodal Large Language Models (MLLMs), unified MLLMs that jointly perform image understanding and generation have advanced significantly. However, despite the inherent reasoning capabilities of unified MLLMs for self-reflection and self-refinement, their use in text-to-image generation remains largely underexplored. Meanwhile, existing multimodal reasoning-based image generation methods mostly rely on holistic image-text alignment judgments, without fine-grained reflection and refinement of detailed prompt attributes, leading to limited fine-grained control. Therefore, we propose Fine-grained Multimodal Reasoning (FiMR), a framework that leverages decomposed visual question answering (VQA) to break down an input prompt into minimal semantic units-such as entities and attributes-and verify each unit via VQA to generate explicit, fine-grained feedback. Based on this feedback, FiMR then applies targeted, localized refinements. This fine-grained self-reasoning and self-refinement enable MLLMs to achieve more precise improvements in image-prompt alignment and overall generation quality at test time. Extensive experiments demonstrate that FiMR consistently outperforms image generation baselines, including reasoning-based methods, particularly on compositional text-to-image benchmarks.

关键词: Multimodal Large Language Models, text-to-image generation, fine-grained reasoning, visual question answering, self-reflection, self-refinement, image-prompt alignment, FiMR framework

210. ❌ Learning Class Difficulty in Imbalanced Histopathology Segmentation via Dynamic Focal Attention

作者: Lakmali Nadeesha Kumari, Sen-Ching Samson Cheung 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13479v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文专注于医学图像分割中的类别不平衡问题，提出了一种基于动态焦点注意力的方法，属于计算机视觉和医学图像分析领域。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、代理系统等）完全无关，因为这些关键词主要针对自然语言处理和大语言模型技术。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该论文属于AI在生物医学（具体是组织病理学图像分析）领域的应用，属于AI for Science的范畴，但并非核心大模型应用，因此给予8分（有一定关联，但非核心）。

!!! tip deepseek-chat TL;DR

该论文针对组织病理学图像语义分割中的类别不平衡问题，提出了一种动态焦点注意力机制，通过学习类别特定难度来改进分割性能，在多个基准测试中取得了优于传统损失重加权方法的效果。

摘要翻译

组织病理学图像在类别不平衡条件下的语义分割通常通过基于频率的损失重加权方法处理，该方法隐含假设稀有类别难以分割。然而，真正的分割困难还源于形态学变异、边界模糊和上下文相似性等频率无法捕捉的因素。我们提出动态焦点注意力机制，这是一种简单高效的机制，可在基于查询的掩码解码器的交叉注意力中直接学习类别特异性难度。该机制通过为注意力逻辑值引入可学习的每类别偏置，实现在预测前进行表征层面的重加权，而非在预测后进行梯度层面的重加权。该偏置以对数频率先验初始化以防止梯度匮乏，并通过端到端优化使模型能够自适应地捕捉训练过程中的难度信号，从而在统一的注意力偏置框架下有效融合基于频率和难度感知的方法。在三个组织病理学基准数据集上，该方法持续提升Dice和IoU指标，在不依赖独立估计器或额外训练阶段的情况下，达到或超越了难度感知基线的性能。这些结果表明，在表征层面编码类别难度为不平衡分割任务提供了一种理论依据明确的替代方案，可替代传统的损失重加权方法。

摘要 (Abstract)

Semantic segmentation of histopathology images under class imbalance is typically addressed through frequency-based loss reweighting, which implicitly assumes that rare classes are difficult. However, true difficulty also arises from morphological variability, boundary ambiguity, and contextual similarity-factors that frequency cannot capture. We propose Dynamic Focal Attention (DFA), a simple and efficient mechanism that learns class-specific difficulty directly within the cross-attention of query-based mask decoders. DFA introduces a learnable per-class bias to attention logits, enabling representation-level reweighting prior to prediction rather than gradient-level reweighting after prediction. Initialised from a log-frequency prior to prevent gradient starvation, the bias is optimised end-to-end, allowing the model to adaptively capture difficulty signals through training, effectively unifying frequency-based and difficulty-aware approaches under a common attention-bias framework. On three histopathology benchmarks (BDSA, BCSS, CRAG), DFA consistently improves Dice and IoU, matching or exceeding a difficulty-aware baseline without a separate estimator or additional training stage. These results demonstrate that encoding class difficulty at the representation level provides a principled alternative to conventional loss reweighting for imbalanced segmentation.

关键词: histopathology segmentation, class imbalance, dynamic focal attention, difficulty-aware learning, cross-attention, mask decoder, semantic segmentation, medical image analysis

211. ❌ RobotPan: A 360$^\circ$ Surround-View Robotic Vision System for Embodied Perception

作者: Jiahao Ma, Qiang Zhang, Peiran Liu, Zeran Su, Pihai Sun, Gang Han, Wen Zhao, Wei Cui, Zhang Zhang, Zhiyuan Xu, Renjing Xu, Jian Tang, Miaomiao Liu, Yijie Guo 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13476v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于机器人视觉系统，提出了一种用于实时渲染、重建和流式传输的360度环绕视图框架。论文内容涉及计算机视觉、机器人感知、3D重建和实时系统，但完全不涉及大语言模型、深度学习技术原理、AI for Science或任何评分关键词中的技术。所有关键词均与大模型、深度学习技术、AI科学应用等相关，而本文是纯粹的机器人视觉和感知系统研究，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对机器人导航和操作中环绕视图感知的局限性，提出了一种名为RobotPan的实时360度视觉系统，通过预测度量尺度的紧凑3D高斯来实现高质量重建和视图合成，同时减少了计算冗余。

摘要翻译

环绕视觉感知对于机器人导航与移动操作日益重要，尤其在遥操作、数据采集和紧急接管等人在回路的场景中。然而，当前的机器人视觉界面通常局限于狭窄的前向视野，或在配备多路车载摄像头时，需要繁琐的手动切换，这会打断操作者的工作流程。这两种配置都存在运动引起的画面抖动问题，易导致头戴式显示器使用者产生模拟器晕动症。我们提出了一种环绕视觉机器人系统，该系统结合六路摄像头与激光雷达，提供完整的360度视觉覆盖，同时满足实体部署的几何约束与实时性要求。我们进一步提出了\textsc{RobotPan}——一种前馈框架，能够从标定的稀疏视角输入中预测具有度量尺度且紧凑的三维高斯分布，以实现实时渲染、重建与流式传输。\textsc{RobotPan}将多视角特征提升至统一的球坐标系表示，并利用分层球体体素先验解码高斯分布，在机器人近处分配精细分辨率，在较大半径处使用较粗分辨率，从而在不牺牲保真度的前提下减少计算冗余。为支持长序列处理，我们的在线融合方法在更新动态内容的同时，通过选择性更新外观来防止静态区域的无限制增长。最后，我们发布了一个专为机器人360度新视角合成与度量三维重建定制的多传感器数据集，涵盖了真实平台上的导航、操作与移动任务。实验表明，\textsc{RobotPan}在质量上与先前的基于前馈的重建及视角合成方法相比具有竞争力，同时生成的高斯分布数量显著减少，实现了实用的实时实体部署。项目网站：https://robotpan.github.io/

摘要 (Abstract)

Surround-view perception is increasingly important for robotic navigation and loco-manipulation, especially in human-in-the-loop settings such as teleoperation, data collection, and emergency takeover. However, current robotic visual interfaces are often limited to narrow forward-facing views, or, when multiple on-board cameras are available, require cumbersome manual switching that interrupts the operator’s workflow. Both configurations suffer from motion-induced jitter that causes simulator sickness in head-mounted displays. We introduce a surround-view robotic vision system that combines six cameras with LiDAR to provide full 360$^\circ$ visual coverage, while meeting the geometric and real-time constraints of embodied deployment. We further present \textsc{RobotPan}, a feed-forward framework that predicts \emph{metric-scaled} and \emph{compact} 3D Gaussians from calibrated sparse-view inputs for real-time rendering, reconstruction, and streaming. \textsc{RobotPan} lifts multi-view features into a unified spherical coordinate representation and decodes Gaussians using hierarchical spherical voxel priors, allocating fine resolution near the robot and coarser resolution at larger radii to reduce computational redundancy without sacrificing fidelity. To support long sequences, our online fusion updates dynamic content while preventing unbounded growth in static regions by selectively updating appearance. Finally, we release a multi-sensor dataset tailored to 360$^\circ$ novel view synthesis and metric 3D reconstruction for robotics, covering navigation, manipulation, and locomotion on real platforms. Experiments show that \textsc{RobotPan} achieves competitive quality against prior feed-forward reconstruction and view-synthesis methods while producing substantially fewer Gaussians, enabling practical real-time embodied deployment. Project website: https://robotpan.github.io/

关键词: surround-view perception, robotic vision system, 360-degree visual coverage, 3D Gaussians, real-time rendering, metric 3D reconstruction, novel view synthesis, embodied deployment

212. ❌ MyoVision: A Mobile Research Tool and NEATBoost-Attention Ensemble Framework for Real Time Chicken Breast Myopathy Detection

作者: Chaitanya Pallerla, Siavash Mahmoudi, Dongyi Wang 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13456v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于使用移动成像和机器学习（LightGBM、MLP、NEAT）进行鸡肉肌病检测，属于AI在生物信息学/科学领域的应用。论文未涉及任何大语言模型（LLM）、深度学习技术原理、模型训练/对齐方法、推理优化、代理系统等核心大模型技术。仅与关键词’AI for Science OR Bioinformatics OR Cheminformatics’有弱关联（应用层面），其他关键词完全无关。

!!! tip deepseek-chat TL;DR

该研究开发了名为MyoVision的移动成像框架和NEATBoost-Attention集成模型，用于低成本、非破坏性地实时检测鸡肉中的Woody Breast和Spaghetti Meat肌病，在336个样本测试集上达到82.4%的准确率，性能媲美昂贵的高光谱成像系统。

摘要翻译

木质化鸡胸肉（Woody Breast，简称WB）与絮状肉（Spaghetti Meat，简称SM）肌病显著影响禽肉品质，然而现有检测方法主要依赖主观人工评估或昂贵的实验室级成像系统。本研究致力于利用消费级智能手机实现低成本、无损的多类别肌病分类。我们提出了MyoVision这一移动透射成像框架，通过捕获14位RAW图像并提取反映内部组织异常的结构纹理特征。为分类正常肉、木质化鸡胸肉和絮状肉三类样本，我们提出了一种NEATBoost-Attention集成模型，该模型通过神经进化算法优化了LightGBM与基于注意力机制的多层感知机模型的加权融合。系统采用增强拓扑结构神经进化算法自动搜索超参数，避免了人工调参，并为小型表格数据集实现了架构多样性。在从商业加工厂采集的336块鸡胸肉样本数据集上，本方法取得了82.4%的测试准确率（F1分数=0.83），其性能优于传统机器学习与深度学习基线模型，并与成本高出数个数量级的高光谱成像系统所报道的性能相当。除分类性能外，MyoVision构建了可复现的移动RGB-D采集流程以支持多模态肉质研究，证明了消费级成像技术能够实现可扩展的内部组织评估。

摘要 (Abstract)

Woody Breast (WB) and Spaghetti Meat (SM) myopathies significantly impact poultry meat quality, yet current detection methods rely either on subjective manual evaluation or costly laboratory-grade imaging systems. We address the problem of low-cost, non-destructive multi-class myopathy classification using consumer smartphones. MyoVision is introduced as a mobile transillumination imaging framework in which 14-bit RAW images are captured and structural texture descriptors indicative of internal tissue abnormalities are extracted. To classify three categories (Normal, Woody Breast, Spaghetti Meat), we propose a NEATBoost-Attention Ensemble model, which is a neuroevolution-optimized weighted fusion of LightGBM and attention-based MLP models. Hyperparameters are automatically discovered using NeuroEvolution of Augmenting Topologies (NEAT), eliminating manual tuning and enabling architecture diversity for small tabular datasets. On a dataset of 336 fillets collected from a commercial processing facility, our method achieves 82.4% test accuracy (F1 = 0.83), outperforming conventional machine learning and deep learning baselines and matching performance reported by hyperspectral imaging systems costing orders of magnitude more. Beyond classification performance, MyoVision establishes a reproducible mobile RGB-D acquisition pipeline for multimodal meat quality research, demonstrating that consumer-grade imaging can support scalable internal tissue assessment.

关键词: Myopathy detection, Mobile imaging, Transillumination, NEATBoost-Attention Ensemble, NeuroEvolution of Augmenting Topologies, LightGBM, Poultry meat quality, Real-time classification

213. ❌ VibeFlow: Versatile Video Chroma-Lux Editing through Self-Supervised Learning

作者: Yifan Li, Pei Cheng, Bin Fu, Shuai Yang, Jiaying Liu 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13425v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文VibeFlow专注于视频颜色-光照编辑的计算机视觉任务，提出了一种自监督学习框架，利用预训练视频生成模型的内在物理理解。论文的核心是视频处理、自监督学习、结构保持和时间一致性，与绝大多数关键词（特别是大模型、推理、对齐、代理等）完全无关。唯一略有相关的是’Pre-training OR Continual Pre-training OR Domain Adaptation’，因为论文提到了利用’pre-trained video generation models’，但这并非论文的创新重点，只是作为基础模型使用，因此给予5分（有一定关联）。其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为VibeFlow的自监督学习框架，用于视频颜色和光照编辑，通过解耦数据扰动和引入残差速度场，在无需成对监督数据的情况下实现了高质量的视频重光照、重新着色等多种编辑任务。

摘要翻译

视频色光编辑旨在修改光照与色彩，同时保持结构与时序保真度，这仍是一个重要挑战。现有方法通常依赖于使用合成配对数据进行昂贵的监督训练。本文提出VibeFlow，一种新颖的自监督框架，它释放了预训练视频生成模型对物理本质的内在理解。我们不再从零学习色彩与光照的转换，而是引入了一种解耦的数据扰动流程，强制模型自适应地重组源视频的结构与参考图像的色彩-光照线索，从而以自监督方式实现鲁棒的解耦。此外，为纠正基于光流模型固有的离散化误差，我们引入了残差速度场与结构失真一致性正则化，以确保严格的结构保持和时序连贯性。我们的框架无需昂贵的训练资源，并以零样本方式泛化至多种应用，包括视频重照明、重新着色、低光增强、昼夜转换以及特定对象色彩编辑。大量实验表明，VibeFlow以显著降低的计算开销实现了令人印象深刻的视觉质量。我们的项目已公开于https://lyf1212.github.io/VibeFlow-webpage。

摘要 (Abstract)

Video chroma-lux editing, which aims to modify illumination and color while preserving structural and temporal fidelity, remains a significant challenge. Existing methods typically rely on expensive supervised training with synthetic paired data. This paper proposes VibeFlow, a novel self-supervised framework that unleashes the intrinsic physical understanding of pre-trained video generation models. Instead of learning color and light transitions from scratch, we introduce a disentangled data perturbation pipeline that enforces the model to adaptively recombine structure from source videos and color-illumination cues from reference images, enabling robust disentanglement in a self-supervised manner. Furthermore, to rectify discretization errors inherent in flow-based models, we introduce Residual Velocity Fields alongside a Structural Distortion Consistency Regularization, ensuring rigorous structural preservation and temporal coherence. Our framework eliminates the need for costly training resources and generalizes in a zero-shot manner to diverse applications, including video relighting, recoloring, low-light enhancement, day-night translation, and object-specific color editing. Extensive experiments demonstrate that VibeFlow achieves impressive visual quality with significantly reduced computational overhead. Our project is publicly available at https://lyf1212.github.io/VibeFlow-webpage.

关键词: video chroma-lux editing, self-supervised learning, pre-trained video generation models, disentangled data perturbation, residual velocity fields, structural distortion consistency, temporal coherence, zero-shot generalization

214. ❌ Physically-Guided Optical Inversion Enable Non-Contact Side-Channel Attack on Isolated Screens

作者: Zhiwen Zheng, Yuheng Qiao, Xiaoshuai Zhang, Zhao Huang, Tao Zhang, Huiyu Zhou, Shaowei Jiang, Jin Liu, Wenwen Tang, Xingru Huang 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13419v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是计算机视觉和物理光学领域的非接触式侧信道攻击技术，具体涉及光学投影、辐射传输方程、图像重建等，完全不涉及大语言模型、深度学习技术原理或AI在科学领域的应用。所有关键词均与大模型、深度学习、AI科学应用相关，而本文是纯粹的计算机视觉/光学物理研究，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于光学投影的侧信道攻击方法IR4Net，通过物理正则化辐射近似和不可逆约束语义重投影技术，解决了屏幕内容非接触式窃取中的投影不稳定性和语义信息丢失问题，在多种场景下实现了高保真度的图像重建。

摘要翻译

电子屏幕内容的非接触式信息泄露构成了安全挑战，其中侧信道入侵是主要攻击向量。本文提出一种光学投影侧信道分析范式，旨在解决两个核心不稳定性问题：（i）投影映射的近似奇异雅可比谱破坏了哈达玛稳定性，导致逆映射对扰动极度敏感；（ii）光传输中的不可逆压缩会消除全局语义线索，加剧重建模糊性。通过利用漫反射形成的被动散斑图案，我们提出的辐照度鲁棒辐射度量逆变换网络（IR4Net）融合了物理正则化辐照度近似模块（PRIrr-Approximation）——该模块将辐射传输方程嵌入可学习的优化器，并结合了抑制噪声传播的轮廓到细节跨尺度重建机制。此外，不可逆约束语义重投影模块（ICSR）通过上下文驱动的语义映射恢复丢失的全局结构。在四类场景下的评估表明，IR4Net在保持光照扰动鲁棒性的同时，其重建保真度超越了现有神经方法。

摘要 (Abstract)

Noncontact exfiltration of electronic screen content poses a security challenge, with side-channel incursions as the principal vector. We introduce an optical projection side-channel paradigm that confronts two core instabilities: (i) the near-singular Jacobian spectrum of projection mapping breaches Hadamard stability, rendering inversion hypersensitive to perturbations; (ii) irreversible compression in light transport obliterates global semantic cues, magnifying reconstruction ambiguity. Exploiting passive speckle patterns formed by diffuse reflection, our Irradiance Robust Radiometric Inversion Network (IR4Net) fuses a Physically Regularized Irradiance Approximation (PRIrr-Approximation), which embeds the radiative transfer equation in a learnable optimizer, with a contour-to-detail cross-scale reconstruction mechanism that arrests noise propagation. Moreover, an Irreversibility Constrained Semantic Reprojection (ICSR) module reinstates lost global structure through context-driven semantic mapping. Evaluated across four scene categories, IR4Net achieves fidelity beyond competing neural approaches while retaining resilience to illumination perturbations.

关键词: side-channel attack, optical projection, non-contact exfiltration, irradiance robust radiometric inversion, physically regularized irradiance approximation, irreversibility constrained semantic reprojection, screen content reconstruction, diffuse reflection speckle patterns

215. ❌ CausalDisenSeg: A Causality-Guided Disentanglement Framework with Counterfactual Reasoning for Robust Brain Tumor Segmentation Under Missing Modalities

作者: Bo Liu, Yulong Zou, Jin Hong 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13409v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文专注于医学影像分割的深度学习模型，特别是脑肿瘤分割，使用因果推理和反事实分析来解决模态缺失问题。所有关键词中，只有’AI for Science OR Bioinformatics OR Cheminformatics’与论文相关，因为论文属于生物信息学/医学AI应用领域，但并非大模型技术。其他关键词均涉及大模型、训练技术、推理优化、代理系统等，与论文的因果深度学习框架无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于因果推理和反事实分析的深度学习框架CausalDisenSeg，用于解决多模态脑肿瘤分割中因MRI数据缺失导致的鲁棒性问题，在BraTS数据集上取得了最先进的性能。

摘要翻译

在临床实践中，用于多模态脑肿瘤分割的深度学习模型的鲁棒性常因不完整的MRI数据而严重受损。这一脆弱性主要源于模态偏差，即模型利用虚假相关性作为捷径，而非学习真实的解剖结构。现有的特征融合方法未能从根本上消除这种依赖性。为解决此问题，我们提出了CausalDisenSeg，一种基于结构因果模型（Structural Causal Model, SCM）的新型框架，通过因果引导的解耦与反事实推理实现鲁棒分割。我们将该问题重新定义为从风格偏差因子（Bias Factor）中分离出解剖因果因子（Causal Factor）。我们的框架实施了三阶段因果干预：（1）显式因果解耦：一个条件变分自编码器（Conditional Variational Autoencoder, CVAE）结合希尔伯特-施密特独立性准则（HSIC）约束，在数学上强制解剖特征与风格特征之间的统计正交性。（2）因果表征强化：一个区域因果模块（Region Causality Module, RCM）将因果特征显式地锚定在物理肿瘤区域。（3）反事实推理：一种双重对抗策略主动抑制偏差的残留自然直接效应（Natural Direct Effect, NDE），迫使其空间注意力与因果路径相互排斥。在BraTS 2020数据集上的大量实验表明，CausalDisenSeg在严重缺失模态场景下的准确性和一致性均显著优于现有先进方法。此外，在相同协议下对BraTS 2023进行的跨数据集评估取得了84.49%的宏观平均戴斯相似系数（DSC），达到了当前最优水平。

摘要 (Abstract)

In clinical practice, the robustness of deep learning models for multimodal brain tumor segmentation is severely compromised by incomplete MRI data. This vulnerability stems primarily from modality bias, where models exploit spurious correlations as shortcuts rather than learning true anatomical structures. Existing feature fusion methods fail to fundamentally eliminate this dependency. To address this, we propose CausalDisenSeg, a novel Structural Causal Model (SCM)-grounded framework that achieves robust segmentation via causality-guided disentanglement and counterfactual reasoning. We reframe the problem as isolating the anatomical Causal Factor from the stylistic Bias Factor. Our framework implements a three-stage causal intervention: (1) Explicit Causal Disentanglement: A Conditional Variational Autoencoder (CVAE) coupled with an HSIC constraint mathematically enforces statistical orthogonality between anatomical and style features. (2) Causal Representation Reinforcement: A Region Causality Module (RCM) explicitly grounds causal features in physical tumor regions. (3) Counterfactual Reasoning: A dual-adversarial strategy actively suppresses the residual Natural Direct Effect (NDE) of the bias, forcing its spatial attention to be mutually exclusive from the causal path. Extensive experiments on the BraTS 2020 dataset demonstrate that CausalDisenSeg significantly outperforms state-of-the-art methods in accuracy and consistency across severe missing-modality scenarios. Furthermore, cross-dataset evaluation on BraTS 2023 under the same protocol yields a state-of-the-art macro-average DSC of 84.49.

关键词: brain tumor segmentation, missing modalities, causal disentanglement, counterfactual reasoning, structural causal model, multimodal MRI, robustness, deep learning

216. ❌ Why Multimodal In-Context Learning Lags Behind? Unveiling the Inner Mechanisms and Bottlenecks

作者: Yu Wang, Sharon Li 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13403v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	15.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文核心研究多模态大语言模型中的上下文学习机制，与’In-context Learning OR Many-shot Learning’高度相关（15分），因为论文系统分析了多模态ICL的内部机制和瓶颈。论文也涉及大语言模型（10分），因为研究基于多模态大语言模型。其他关键词如MoE、量化、对齐等均未在摘要中提及，因此得0分。

!!! tip deepseek-chat TL;DR

该论文研究了为什么多模态上下文学习在少样本设置下表现不佳，通过分解任务映射过程揭示了视觉-文本表示缺乏推理级对齐的问题，并提出了一种推理阶段增强方法来改善任务映射传递。

摘要翻译

情境学习（In-context learning, ICL）使模型能够通过推理阶段的示例演示适应新任务。尽管该技术在大型语言模型中取得了成功，但将其扩展到多模态场景时，其内部机制以及与纯文本情境学习的差异仍缺乏深入理解。本研究对多模态大语言模型中的情境学习进行了系统性分析。通过在多种模态间采用相同的任务构建方式，我们发现多模态情境学习在零样本设置下表现与纯文本情境学习相当，但在少样本演示条件下性能显著下降。为探究这一差距，我们将多模态情境学习分解为任务映射构建与任务映射迁移两个阶段，并分析了模型如何建立跨模态任务映射，以及如何在网络层间将其迁移至查询样本。分析表明，当前模型的视觉与文本表征之间缺乏推理层面的对齐，且无法可靠地将习得的任务映射迁移至查询样本。基于这些发现，我们进一步提出了一种简单的推理阶段增强方法，以强化任务映射的迁移过程。研究结果为理解多模态情境学习的机制与局限提供了新视角，并为实现更有效的多模态适应指明了方向。代码已发布于\href{https://github.com/deeplearning-wisc/Multimocal-ICL-Analysis-Framework-MGI}{此处}。

摘要 (Abstract)

In-context learning (ICL) enables models to adapt to new tasks via inference-time demonstrations. Despite its success in large language models, the extension of ICL to multimodal settings remains poorly understood in terms of its internal mechanisms and how it differs from text-only ICL. In this work, we conduct a systematic analysis of ICL in multimodal large language models. Using identical task formulations across modalities, we show that multimodal ICL performs comparably to text-only ICL in zero-shot settings but degrades significantly under few-shot demonstrations. To understand this gap, we decompose multimodal ICL into task mapping construction and task mapping transfer, and analyze how models establish cross-modal task mappings, and transfer them to query samples across layers. Our analysis reveals that current models lack reasoning-level alignment between visual and textual representations, and fail to reliably transfer learned task mappings to queries. Guided by these findings, we further propose a simple inference-stage enhancement method that reinforces task mapping transfer. Our results provide new insights into the mechanisms and limitations of multimodal ICL and suggest directions for more effective multimodal adaptation. Our code is available \href{https://github.com/deeplearning-wisc/Multimocal-ICL-Analysis-Framework-MGI}{here}.

关键词: In-context Learning, Multimodal Large Language Models, Task Mapping, Cross-modal Alignment, Few-shot Learning, Visual-Textual Representations, Inference Enhancement, Multimodal Adaptation

217. ❌ A Multimodal Clinically Informed Coarse-to-Fine Framework for Longitudinal CT Registration in Proton Therapy

作者: Caiwen Jiang, Yuzhen Ding, Mi Jia, Samir H. Patel, Terence T. Sio, Jonathan B. Ashman, Lisa A. McGee, Jean-Claude M. Rwigema, William G. Rule, Sameer R. Keole, Sujay A. Vora, William W. Wong, Nathan Y. Yu, Michele Y. Halyard, Steven E. Schild, Dinggang Shen, Wei Liu 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13397v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于质子治疗中的医学图像配准问题，提出了一种结合CNN和Transformer的深度学习框架，用于纵向CT扫描的变形配准。虽然论文属于AI在科学（医学）领域的应用，但所有关键词（除了’AI for Science OR Bioinformatics OR Cheminformatics’）均与大模型（LLM）技术、训练方法、推理优化、智能体等具体技术直接相关，而本文的核心是医学图像处理中的深度学习模型（CNN+Transformer），并非大模型研究。因此，仅’AI for Science’关键词高度相关（10分），其他关键词完全无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对质子治疗中因解剖结构变化导致的图像配准挑战，提出了一种融合多模态临床信息的从粗到细深度学习框架，在大型质子治疗数据集上实现了比现有方法更快、更鲁棒且临床意义更准确的变形图像配准。

摘要翻译

质子治疗虽能显著降低危及器官受照风险，但对解剖结构变化高度敏感，因此在不同时间点的纵向CT扫描间实现精准的形变图像配准至关重要。传统形变配准方法在新兴的在线自适应工作流程中往往速度过慢，而现有基于深度学习的方法主要针对通用基准设计，未能充分利用图像之外的临床相关信息。为弥补这一不足，我们提出一种临床可扩展的从粗到精形变配准框架，该框架整合了质子放疗工作流程中的多模态信息，以适应多样化的临床场景。该模型采用双路基于CNN的编码器进行分层特征提取，并利用基于Transformer的解码器逐步优化形变场。除CT灰度信息外，模型通过解剖结构与风险引导的注意力机制、文本条件特征调制以及前景感知优化等方法，融合了包括靶区与危及器官轮廓、剂量分布和治疗计划文本在内的临床关键先验信息，从而实现聚焦解剖结构且符合临床认知的形变估计。我们在包含多个解剖部位与疾病类型、总计1,222组计划CT与重复CT扫描配对的大规模质子治疗形变配准数据集上评估了所提框架。大量实验表明，该方法相较现有先进技术取得了一致性提升，能够实现快速、稳健且具有临床意义的图像配准。

摘要 (Abstract)

Proton therapy offers superior organ-at-risk sparing but is highly sensitive to anatomical changes, making accurate deformable image registration (DIR) across longitudinal CT scans essential. Conventional DIR methods are often too slow for emerging online adaptive workflows, while existing deep learning-based approaches are primarily designed for generic benchmarks and underutilize clinically relevant information beyond images. To address this gap, we propose a clinically scalable coarse-to-fine deformable registration framework that integrates multimodal information from the proton radiotherapy workflow to accommodate diverse clinical scenarios. The model employs dual CNN-based encoders for hierarchical feature extraction and a transformer-based decoder to progressively refine deformation fields. Beyond CT intensities, clinically critical priors, including target and organ-at-risk contours, dose distributions, and treatment planning text, are incorporated through anatomy- and risk-guided attention, text-conditioned feature modulation, and foreground-aware optimization, enabling anatomically focused and clinically informed deformation estimation. We evaluate the proposed framework on a large-scale proton therapy DIR dataset comprising 1,222 paired planning and repeat CT scans across multiple anatomical regions and disease types. Extensive experiments demonstrate consistent improvements over state-of-the-art methods, enabling fast and robust clinically meaningful registration.

关键词: proton therapy, deformable image registration, longitudinal CT, multimodal integration, coarse-to-fine framework, CNN-Transformer architecture, clinical prior incorporation, anatomical change adaptation

218. ❌ UniBlendNet: Unified Global, Multi-Scale, and Region-Adaptive Modeling for Ambient Lighting Normalization

作者: Jiatao Dai, Wei Dong, Han Zhou, Chengzhou Tang, Jun Chen 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13383v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文UniBlendNet专注于计算机视觉领域的图像处理任务（环境光照归一化），提出了一种结合全局、多尺度和区域自适应建模的统一框架。其核心贡献在于网络架构设计（如UniConvNet、SAAM模块、掩码引导的残差细化），属于传统的深度学习图像恢复方法。所有评分关键词均与大模型技术、训练方法、推理优化、AI代理、科学AI应用等主题相关，而本文完全不涉及这些主题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

本文提出UniBlendNet框架，通过统一建模全局光照、多尺度结构和区域自适应细化，解决了复杂光照条件下图像恢复的挑战，在NTIRE基准上超越了现有方法。

摘要翻译

环境光照归一化（Ambient Lighting Normalization, ALN）旨在恢复因复杂、空间变化的光照条件而退化的图像。现有方法（如IFBlend）利用频域先验来建模光照变化，但仍存在全局上下文建模能力有限和空间自适应性不足的问题，导致在挑战性区域中恢复效果欠佳。本文提出UniBlendNet，一个用于环境光照归一化的统一框架，能够联合建模全局光照、多尺度结构及区域自适应细化。具体而言，我们通过集成基于UniConvNet的模块来捕获长程依赖关系，从而增强对全局光照的理解。为更好地处理复杂的光照变化，我们引入了尺度感知聚合模块（Scale-Aware Aggregation Module, SAAM），该模块通过动态重加权执行基于金字塔的多尺度特征聚合。此外，我们设计了一种掩码引导的残差细化机制，以实现区域自适应校正，使模型能够选择性地增强退化区域，同时保留曝光良好的区域。这一设计有效提升了复杂光照条件下的光照一致性与结构保真度。在NTIRE环境光照归一化基准上的大量实验表明，UniBlendNet始终优于基线方法IFBlend，并实现了更高的恢复质量，同时产生视觉上更自然、更稳定的恢复结果。

摘要 (Abstract)

Ambient Lighting Normalization (ALN) aims to restore images degraded by complex, spatially varying illumination conditions. Existing methods, such as IFBlend, leverage frequency-domain priors to model illumination variations, but still suffer from limited global context modeling and insufficient spatial adaptivity, leading to suboptimal restoration in challenging regions. In this paper, we propose UniBlendNet, a unified framework for ambient lighting normalization that jointly models global illumination, multi-scale structures, and region-adaptive refinement. Specifically, we enhance global illumination understanding by integrating a UniConvNet-based module to capture long-range dependencies. To better handle complex lighting variations, we introduce a Scale-Aware Aggregation Module (SAAM) that performs pyramid-based multi-scale feature aggregation with dynamic reweighting. Furthermore, we design a mask-guided residual refinement mechanism to enable region-adaptive correction, allowing the model to selectively enhance degraded regions while preserving well-exposed areas. This design effectively improves illumination consistency and structural fidelity under complex lighting conditions. Extensive experiments on the NTIRE Ambient Lighting Normalization benchmark demonstrate that UniBlendNet consistently outperforms the baseline IFBlend and achieves improved restoration quality, while producing visually more natural and stable restoration results.

关键词: Ambient Lighting Normalization, UniBlendNet, global illumination modeling, multi-scale feature aggregation, region-adaptive refinement, image restoration, NTIRE benchmark, IFBlend

219. ❌ Multi-Agent Object Detection Framework Based on Raspberry Pi YOLO Detector and Slack-Ollama Natural Language Interface

作者: Vladimir Kalušev, Branko Brkljač, Milan Brkljač 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.13345v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	10.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究基于Raspberry Pi的边缘多智能体对象检测系统，其中集成了Ollama LLM作为自然语言接口和报告代理，并采用Slack聊天机器人代理进行系统控制。因此，与’Large Language Models’、‘Small Language Models/On-device AI’、‘LLM Agents’和’Multi-agent Systems’高度相关（10分），因为这些是系统的核心组件和架构。‘Tool Use’有一定关联（5分），因为系统涉及LLM与计算机视觉代理的集成，但未明确提及API工具调用。其他关键词如MoE、Scaling Laws、训练方法、推理优化、科学AI应用等均未涉及，故得0分。

!!! tip deepseek-chat TL;DR

该论文提出并原型实现了一个基于Raspberry Pi的边缘多智能体对象检测框架，通过集成Ollama LLM和Slack聊天机器人作为自然语言接口，在资源受限的硬件平台上实现了实时对象检测与跟踪，并探讨了与完全自主云基方案的对比差异。

摘要翻译

本文提出了一种基于边缘计算的目标检测系统的设计与原型实现，该系统采用新型人工智能智能体编排范式。研究突破了传统设计方法，通过基于大语言模型的自然语言接口实现系统控制与通信，并实际演示了将所有系统组件集成到单一资源受限硬件平台的过程。该方法基于提出的多智能体目标检测框架，该框架将不同AI智能体紧密整合于提供目标检测与跟踪能力的统一任务中。所提出的设计原则凸显了生成式AI系统转型潜力所特有的快速原型开发方法，该方法在系统开发与实施阶段均得到应用。系统摒弃了专用通信控制接口，转而采用Slack频道聊天机器人智能体与配套的Ollama大语言模型报告智能体，两者均与执行实时目标检测跟踪的专用YOLO计算机视觉智能体共同运行于同一树莓派平台。智能体编排通过专门设计的事件驱动消息交换子系统实现，这为当前基于大语言模型的框架（如近期提出的OpenClaw）中完全自主的智能体编排控制模式提供了替代方案。实验研究为低成本测试平台在设计完全集中式多智能体AI系统时的局限性提供了重要见解。本文还讨论了所提方案与需要额外云端外部资源的解决方案之间的对比差异。

摘要 (Abstract)

The paper presents design and prototype implementation of an edge based object detection system within the new paradigm of AI agents orchestration. It goes beyond traditional design approaches by leveraging on LLM based natural language interface for system control and communication and practically demonstrates integration of all system components into a single resource constrained hardware platform. The method is based on the proposed multi-agent object detection framework which tightly integrates different AI agents within the same task of providing object detection and tracking capabilities. The proposed design principles highlight the fast prototyping approach that is characteristic for transformational potential of generative AI systems, which are applied during both development and implementation stages. Instead of specialized communication and control interface, the system is made by using Slack channel chatbot agent and accompanying Ollama LLM reporting agent, which are both run locally on the same Raspberry Pi platform, alongside the dedicated YOLO based computer vision agent performing real time object detection and tracking. Agent orchestration is implemented through a specially designed event based message exchange subsystem, which represents an alternative to completely autonomous agent orchestration and control characteristic for contemporary LLM based frameworks like the recently proposed OpenClaw. Conducted experimental investigation provides valuable insights into limitations of the low cost testbed platforms in the design of completely centralized multi-agent AI systems. The paper also discusses comparative differences between presented approach and the solution that would require additional cloud based external resources.

关键词: multi-agent system, object detection, edge computing, Raspberry Pi, LLM interface, Ollama, Slack chatbot, agent orchestration

220. ❌ MSGS: Multispectral 3D Gaussian Splatting

作者: Iris Zheng, Guojun Tang, Alexander Doronin, Paul Teal, Fang-Lue Zhang 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.13340v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉和图形学领域的3D重建与渲染技术（3D Gaussian Splatting的多光谱扩展），所有评分关键词均涉及大语言模型（LLM）及其相关技术（如训练、对齐、推理、应用等）。论文内容完全不涉及语言模型、深度学习技术原理或AI for Science的具体应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种多光谱3D高斯泼溅方法，通过为每个高斯添加光谱辐射度并使用双损失监督方案进行优化，在保持实时效率的同时，显著提升了具有半透明材料和各向异性反射等挑战性场景的渲染质量和光谱一致性。

摘要翻译

我们提出了一种面向波长感知视图合成的三维高斯泼溅（3DGS）多光谱扩展方法。每个高斯单元均通过各波段球谐函数表征的光谱辐射度进行增强，并在结合RGB与多光谱信号的双重损失监督方案下进行优化。为提升渲染保真度，我们在像素级执行光谱到RGB的转换，使得优化过程中能保留更丰富的光谱信息。该方法在公开数据集与自主采集的真实场景数据集上均进行了评估，结果表明其在图像质量与光谱一致性方面较仅使用RGB的3DGS基线模型均有持续改进。值得注意的是，该方法在处理涉及半透明材料与各向异性反射的复杂场景时表现优异。所提出的方法在保持3DGS紧凑性与实时效率的同时，为未来与基于物理的着色模型集成奠定了基础。

摘要 (Abstract)

We present a multispectral extension to 3D Gaussian Splatting (3DGS) for wavelength-aware view synthesis. Each Gaussian is augmented with spectral radiance, represented via per-band spherical harmonics, and optimized under a dual-loss supervision scheme combining RGB and multispectral signals. To improve rendering fidelity, we perform spectral-to-RGB conversion at the pixel level, allowing richer spectral cues to be retained during optimization. Our method is evaluated on both public and self-captured real-world datasets, demonstrating consistent improvements over the RGB-only 3DGS baseline in terms of image quality and spectral consistency. Notably, it excels in challenging scenes involving translucent materials and anisotropic reflections. The proposed approach maintains the compactness and real-time efficiency of 3DGS while laying the foundation for future integration with physically based shading models.

关键词: 3D Gaussian Splatting, multispectral, view synthesis, spectral radiance, spherical harmonics, dual-loss supervision, real-time rendering, translucent materials

221. ❌ SEDTalker: Emotion-Aware 3D Facial Animation Using Frame-Level Speech Emotion Diarization

作者: Farzaneh Jafari, Stefano Berretti, Anup Basu 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.13335v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文SEDTalker专注于语音驱动的3D面部动画，利用帧级语音情感分割实现细粒度表情控制。研究内容涉及语音情感识别、3D动画生成和Transformer-Mamba混合架构，属于计算机视觉、语音处理和多媒体领域的应用研究。所有评分关键词均与大模型、深度学习技术原理或AI在科学领域的应用直接相关，而本文未涉及任何大模型技术、深度学习创新或AI for Science的具体应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于帧级语音情感分割的3D面部动画框架SEDTalker，通过混合Transformer-Mamba架构实现了语音驱动的、情感可控的面部表情生成，并在多个数据集上验证了其有效性和高质量表现。

摘要翻译

本文提出SEDTalker，一种面向语音驱动三维面部动画的情感感知框架，其利用帧级语音情感日记化技术实现细粒度的表情控制。与以往依赖语句级或手动指定情感标签的方法不同，本方法直接从语音中预测时序密集的情感类别与强度，从而实现对面部表情的连续时序调制。经日记化的情感信号被编码为可学习的嵌入向量，并用于条件化一个基于混合Transformer-Mamba架构的语音驱动三维动画模型。该设计在保持身份特征与时序连贯性的同时，有效解耦了语言内容与情感风格。我们在大规模多语料库数据集上评估了语音情感日记化性能，并在EmoVOCA数据集上评估了情感三维面部动画效果。定量结果表明，本方法在帧级情感识别任务上表现优异，且几何重建误差与时序重建误差较低；定性分析则显示出平滑的情感过渡与一致的表情控制能力。这些发现凸显了帧级情感日记化技术在生成富有表现力且可控的三维说话人头像方面的有效性。

摘要 (Abstract)

We introduce SEDTalker, an emotion-aware framework for speech-driven 3D facial animation that leverages frame-level speech emotion diarization to achieve fine-grained expressive control. Unlike prior approaches that rely on utterance-level or manually specified emotion labels, our method predicts temporally dense emotion categories and intensities directly from speech, enabling continuous modulation of facial expressions over time. The diarized emotion signals are encoded as learned embeddings and used to condition a speech-driven 3D animation model based on a hybrid Transformer-Mamba architecture. This design allows effective disentanglement of linguistic content and emotional style while preserving identity and temporal coherence. We evaluate our approach on a large-scale multi-corpus dataset for speech emotion diarization and on the EmoVOCA dataset for emotional 3D facial animation. Quantitative results demonstrate strong frame-level emotion recognition performance and low geometric and temporal reconstruction errors, while qualitative results show smooth emotion transitions and consistent expression control. These findings highlight the effectiveness of frame-level emotion diarization for expressive and controllable 3D talking head generation.

关键词: 3D facial animation, speech emotion diarization, frame-level emotion, Transformer-Mamba architecture, speech-driven animation, emotional expression control, temporal coherence, hybrid architecture

222. ❌ SSD-GS: Scattering and Shadow Decomposition for Relightable 3D Gaussian Splatting

作者: Iris Zheng, Guojun Tang, Alexander Doronin, Paul Teal, Fang-Lue Zhang 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.13333v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文SSD-GS专注于计算机视觉和图形学领域，研究基于3D高斯泼溅的物理重光照框架，涉及光-材质交互、散射、阴影分解等。所有评分关键词均与大模型、深度学习技术原理、AI for Science等主题相关，但该论文完全不涉及这些内容，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了SSD-GS框架，通过将反射分解为漫反射、镜面反射、阴影和次表面散射四个分量，解决了现有3D高斯泼溅重光照方法在物理保真度和可解释性上的不足，实现了高质量重建和逼真重光照。

摘要翻译

本文提出SSD-GS，一个基于物理的重新光照框架，该框架建立在3D高斯泼溅（3D Gaussian Splatting, 3DGS）技术之上，能够在新颖光照条件下实现高质量重建与照片级真实的重新光照效果。在基于物理的重新光照中，精确建模光与材质的相互作用对于忠实再现外观至关重要。然而，现有基于3DGS的重新光照方法采用粗略的着色分解，要么仅建模漫反射和镜面反射，要么依赖神经网络近似阴影与散射效应。这导致保真度有限且物理可解释性较差，尤其对于各向异性金属和半透明材质。为克服这些局限，SSD-GS将反射分解为四个分量：漫反射、镜面反射、阴影及次表面散射。我们引入了基于可学习偶极子的散射模块用于次表面传输，一种结合可见性估计与优化网络的遮挡感知阴影公式，以及采用基于各向异性菲涅尔模型的增强镜面反射分量。通过在训练过程中逐步整合所有分量，SSD-GS有效解耦了光照与材质属性，即使对于未见过的光照条件亦能如此，这在具有挑战性的OLAT数据集上得到了验证。实验表明，相较于现有方法，SSD-GS在定量指标与视觉感知上均展现出更优的重新光照质量，并为下游任务（包括可控光源编辑与交互式场景重新光照）奠定了基础。源代码发布于：https://github.com/irisfreesiri/SSD-GS。

摘要 (Abstract)

We present SSD-GS, a physically-based relighting framework built upon 3D Gaussian Splatting (3DGS) that achieves high-quality reconstruction and photorealistic relighting under novel lighting conditions. In physically-based relighting, accurately modeling light-material interactions is essential for faithful appearance reproduction. However, existing 3DGS-based relighting methods adopt coarse shading decompositions, either modeling only diffuse and specular reflections or relying on neural networks to approximate shadows and scattering. This leads to limited fidelity and poor physical interpretability, particularly for anisotropic metals and translucent materials. To address these limitations, SSD-GS decomposes reflectance into four components: diffuse, specular, shadow, and subsurface scattering. We introduce a learnable dipole-based scattering module for subsurface transport, an occlusion-aware shadow formulation that integrates visibility estimates with a refinement network, and an enhanced specular component with an anisotropic Fresnel-based model. Through progressive integration of all components during training, SSD-GS effectively disentangles lighting and material properties, even for unseen illumination conditions, as demonstrated on the challenging OLAT dataset. Experiments demonstrate superior quantitative and perceptual relighting quality compared to prior methods and pave the way for downstream tasks, including controllable light source editing and interactive scene relighting. The source code is available at: https://github.com/irisfreesiri/SSD-GS.

关键词: 3D Gaussian Splatting, physically-based relighting, shadow decomposition, subsurface scattering, anisotropic materials, light-material interaction, OLAT dataset, photorealistic rendering

223. ❌ Right Regions, Wrong Labels: Semantic Label Flips in Segmentation under Correlation Shift

作者: Akshit Achara, Yovin Yathathugoda, Nick Byrne, Michela Antonelli, Esther Puyol Anton, Alexander Hammers, Andrew P. King 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.13326v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究语义分割模型在相关分布偏移下的鲁棒性问题，特别是语义标签翻转现象。论文内容聚焦于计算机视觉中的语义分割任务，研究机器学习模型的鲁棒性、分布偏移和错误分析。所有评分关键词均与大模型、深度学习技术原理、AI科学应用等主题相关，但该论文完全不涉及这些主题。论文没有讨论任何形式的大语言模型、模型架构、训练技术、推理方法、对齐技术、高效微调、检索增强、上下文扩展、注意力优化、推理链、系统2思维、蒙特卡洛树搜索、自我纠正、智能体、工具使用、多智能体系统、量化、推测解码、幻觉缓解、可解释AI、世界模型、模型合并、上下文学习或AI科学应用。因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了语义分割模型在训练数据存在虚假相关性时出现的语义标签翻转问题，提出了一种诊断方法量化这种错误，并开发了一个基于熵的无标签翻转风险评分来识别易出错案例。

摘要翻译

机器学习模型的鲁棒性可能因输入数据中的非因果特征与目标标签之间的伪相关性而受损。测试此类相关性的常用方法是在标签与某些非因果线索强关联的数据上进行训练，然后在关联失效的样本上进行评估。这一思路在分类任务中已得到广泛验证，但对于语义分割任务，其具体的失效模式尚未被充分理解。我们证明，即使物体边界基本正确，模型也可能在分配错误语义标签的情况下实现合理的重叠度，即将一种合理的前景类别误判为另一种。我们聚焦于这种语义标签翻转行为，并通过一种简单的诊断指标（Flip）对其进行量化：该指标统计真实前景像素被错误分配为其他前景类别但仍被预测为前景的频率。在训练阶段类别与场景存在相关性的设定下，增强相关性会持续扩大常见测试条件与罕见测试条件之间的性能差距，并增加反事实组中物体内部的标签翻转现象。总体而言，我们的研究结果表明，在分布偏移下评估分割模型的鲁棒性时，需超越重叠度指标，将前景误差分解为正确像素、翻转标签像素和漏检至背景的像素。我们还提出一种基于信息熵、无需真实标签的“翻转风险”评分，该评分通过前景类别不确定性计算，并证明其能在推理阶段识别易发生翻转的案例。代码发布于 https://github.com/acharaakshit/label-flips。

摘要 (Abstract)

The robustness of machine learning models can be compromised by spurious correlations between non-causal features in the input data and target labels. A common way to test for such correlations is to train on data where the label is strongly tied to some non-causal cue, then evaluate on examples where that tie no longer holds. This idea is well established for classification tasks, but for semantic segmentation the specific failure modes are not well understood. We show that a model may achieve reasonable overlap while assigning the wrong semantic label, swapping one plausible foreground class for another, even when object boundaries are largely correct. We focus on this semantic label-flip behaviour and quantify it with a simple diagnostic (Flip) that counts how often ground truth foreground pixels are assigned the wrong foreground identity while remaining predicted as foreground. In a setting where category and scene are correlated during training, increasing the correlation consistently widens the gap between common and rare test conditions and increases these within-object label swaps on counterfactual groups. Overall, our results motivate assessing segmentation robustness under distribution shift beyond overlap by decomposing foreground errors into correct pixels, flipped-identity pixels, and missed-to-background pixels. We also propose an entropy-based, ground truth label-free `flip-risk’ score, which is computed from foreground identity uncertainty, and show that it can flag flip-prone cases at inference time. Code is available at https://github.com/acharaakshit/label-flips.

关键词: semantic segmentation, robustness, distribution shift, spurious correlations, label flips, foreground errors, entropy-based score, flip-risk

224. ❌ Towards Successful Implementation of Automated Raveling Detection: Effects of Training Data Size, Illumination Difference, and Spatial Shift

作者: Xinan Zhang, Haolin Wang, Zhongyu Yang, Yi-Chang, Tsai 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.13322v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于使用机器学习和深度学习进行沥青路面剥落检测，研究训练数据量、光照差异和空间偏移对模型鲁棒性的影响，并提出了RavelingArena基准。论文内容与所有评分关键词（均涉及大模型、深度学习技术原理、AI for Science等）完全无关，因为这些关键词特指大语言模型、MoE、量化、对齐、推理加速、AI for Science（生物/化学信息学）等具体技术，而本文是传统的计算机视觉/深度学习在土木工程中的应用，未涉及任何大模型或评分关键词中的技术。

!!! tip deepseek-chat TL;DR

该研究通过分析训练数据量、光照差异和空间偏移对模型鲁棒性的影响，提出了RavelingArena基准来评估沥青路面剥落检测模型的性能，实验表明增加训练数据的数量和多样性可提升至少9.2%的准确率，并在实际案例中改善了年度一致性。

摘要翻译

集料剥落是沥青路面表层病害的主要形式，尤其在高速公路上更为常见。尽管研究表明，基于机器学习和深度学习的方法通过对距离图像进行分类，在剥落检测方面取得了良好效果，但在大规模实际部署中，当推理数据来源更多样化（可能来自不同作业过程、传感器或环境条件）时，其性能往往会下降。这种性能退化凸显了在实际应用中需要更具泛化性和鲁棒性的解决方案。因此，本研究的目标是：1）识别并评估影响模型鲁棒性的潜在变异因素，如训练数据量、光照差异和空间偏移；2）利用研究结果提升模型在真实环境条件下的鲁棒性。为此，我们提出了RavelingArena基准测试平台，旨在评估剥落检测模型对各类变异的鲁棒性。该平台并非通过收集大量新数据构建，而是通过对现有数据集进行多样化、受控的增强来建立，从而支持通过变异控制实验量化每种变异的影响。实验结果表明，训练数据的数量与多样性对模型准确性至关重要，在最多样化的实验条件下，模型准确性至少提升了9.2%。此外，通过在美国佐治亚州一处多年试验路段应用这些发现的案例研究表明，模型在跨年度检测中的一致性得到显著改善，为未来开展时变劣化建模研究奠定了基础。这些见解为在剥落检测及其他需要适应多样化条件的实际任务中实现更可靠的模型部署提供了指导。

摘要 (Abstract)

Raveling, the loss of aggregates, is a major form of asphalt pavement surface distress, especially on highways. While research has shown that machine learning and deep learning-based methods yield promising results for raveling detection by classification on range images, their performance often degrades in large-scale deployments where more diverse inference data may originate from different runs, sensors, and environmental conditions. This degradation highlights the need of a more generalizable and robust solution for real-world implementation. Thus, the objectives of this study are to 1) identify and assess potential variations that impact model robustness, such as the quantity of training data, illumination difference, and spatial shift; and 2) leverage findings to enhance model robustness under real-world conditions. To this end, we propose RavelingArena, a benchmark designed to evaluate model robustness to variations in raveling detection. Instead of collecting extensive new data, it is built by augmenting an existing dataset with diverse, controlled variations, thereby enabling variation-controlled experiments to quantify the impact of each variation. Results demonstrate that both the quantity and diversity of training data are critical to the accuracy of models, achieving at least a 9.2% gain in accuracy under the most diverse conditions in experiments. Additionally, a case study applying these findings to a multi-year test section in Georgia, U.S., shows significant improvements in year-to-year consistency, laying foundations for future studies on temporal deterioration modeling. These insights provide guidance for more reliable model deployment in raveling detection and other real-world tasks that require adaptability to diverse conditions.

关键词: raveling detection, asphalt pavement, model robustness, training data diversity, RavelingArena benchmark, deep learning, computer vision, real-world deployment

225. ❌ The Spectrascapes Dataset: Street-view imagery beyond the visible captured using a mobile platform

作者: Akshit Gupta, Joris Timmermans, Filip Biljecki, Remko Uijlenhoet 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.13315v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文介绍了一个多光谱街景图像数据集（Spectrascapes），专注于城市环境监测、气候适应型城市和遥感应用。论文内容涉及数据采集方法、校准、质量控制和下游应用案例，但完全不涉及大语言模型、深度学习技术原理、AI for Science或其他评分关键词中的任何技术。所有关键词均与论文主题无关，因此所有评分均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个名为Spectrascapes的新型多光谱街景图像数据集，用于克服现有城市监测方法的局限性，并展示了其在机器学习和城市规划领域的应用潜力。

摘要翻译

获取高时空分辨率数据对于建设气候适应性城市至关重要。当前用于监测城市参数的数据集主要通过人工巡检、嵌入式传感、遥感或标准街景影像（RGB）构建。这些方法与数据集分别受限于可扩展性差、时空分辨率不一致、俯视视角或光谱信息不足等问题。本文提出一种创新方法及其开源实现：一套规避上述局限的多光谱地面视角数据集。该数据集包含17,718张街道层级多光谱图像，通过搭载于自行车上的RGB、近红外与热成像传感器采集，覆盖荷兰多样化的城市形态（乡村、城镇、小城市及大型都市区）。研究严格注重数据校准与质量管控，同时详细公开数据采集方法（包括硬件与软件细节）。据我们所知，“光谱景观”（Spectrascapes）是首个此类开源数据集。最后，我们展示了基于该数据集的两个下游应用案例，并提出了在机器学习、城市规划与遥感领域的潜在研究方向。

摘要 (Abstract)

High-resolution data in spatial and temporal contexts is imperative for developing climate resilient cities. Current datasets for monitoring urban parameters are developed primarily using manual inspections, embedded-sensing, remote sensing, or standard street-view imagery (RGB). These methods and datasets are often constrained respectively by poor scalability, inconsistent spatio-temporal resolutions, overhead views or low spectral information. We present a novel method and its open implementation: a multi-spectral terrestrial-view dataset that circumvents these limitations. This dataset consists of 17,718 street level multi-spectral images captured with RGB, Near-infrared, and Thermal imaging sensors on bikes, across diverse urban morphologies (village, town, small city, and big urban area) in the Netherlands. Strict emphasis is put on data calibration and quality while also providing the details of our data collection methodology (including the hardware and software details). To the best of our knowledge, Spectrascapes is the first open-access dataset of its kind. Finally, we demonstrate two downstream use-cases enabled using this dataset and provide potential research directions in the machine learning, urban planning and remote sensing domains.

关键词: multi-spectral imagery, street-view dataset, urban monitoring, climate resilient cities, remote sensing, data calibration, machine learning applications, urban planning

226. ❌ Why MLLMs Struggle to Determine Object Orientations

作者: Anju Gopinath, Nikhil Krishnaswamy, Bruce Draper 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.13321v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究多模态大语言模型（MLLMs）在2D物体方向推理任务上的失败原因，核心聚焦于视觉编码器（如CLIP、SigLIP、ViT）是否保留方向信息。论文直接涉及大语言模型（LLMs）和多模态大语言模型（MLLMs），因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。研究涉及推理失败的分析，与’Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’和’System 2 Thinking OR Slow Thinking OR In-depth Reasoning’有一定关联（各5分），因为方向推理是多步推理的一种形式。论文通过实验检验编码器表示的可解释性，与’Mechanistic Interpretability OR Explainable AI’相关（5分）。其他关键词如MoE、SLMs、训练方法、对齐、RAG、压缩、代理等均未在论文中提及或相关，故评0分。

!!! tip deepseek-chat TL;DR

论文研究发现，多模态大语言模型（MLLMs）在2D物体方向推理任务上的失败并非由于视觉编码器（如CLIP、SigLIP）缺乏方向信息，因为线性回归器可从编码器嵌入中准确预测方向，但方向信息分散在大量特征中，可能影响模型利用。

摘要翻译

先前研究指出，多模态大语言模型（MLLMs）在处理需要推理图像中二维物体方向的任务时存在困难。Tong等人与Nichols等人推测，这些失败源于视觉编码器，因为常用的编码器（如CLIP和SigLIP）是为图像-文本语义对齐而非几何推理任务而训练的。我们设计了一套受控实验方案来检验这一观点，通过测量能否从编码器表征中恢复旋转信息。具体而言，我们分别使用完整图像检测了LLaVA OneVision模型中的SigLIP特征与Qwen2.5-VL-7B-Instruct模型中的ViT特征，并在LLaVA 1.5和1.6模型中，将旋转后的前景图像块置于自然背景图像上，检验了CLIP表征。我们的原假设是方向信息未保留在编码器嵌入中，并通过训练线性回归器从编码特征预测物体方向来检验该假设。与假设相反，我们发现方向信息可从编码器表征中恢复：简单的线性模型能准确从嵌入中预测物体方向。这一结果反驳了“MLLMs的方向识别失败源于视觉编码器”的普遍假设。
在否定了“MLLMs因视觉编码器限制而难以处理二维方向任务”这一公认假设后，我们仍不清楚其失败原因。尽管完整解释超出本文范围，但我们证明：虽然方向信息确实存在，但它分散在数万个特征中。这可能是MLLMs未能有效利用可用方向信息的原因，但具体机制仍有待探究。

摘要 (Abstract)

Multimodal Large Language Models (MLLMs) struggle with tasks that require reasoning about 2D object orientation in images, as documented in prior work. Tong et al. and Nichols et al. hypothesize that these failures originate in the visual encoder, since commonly used encoders such as CLIP and SigLIP are trained for image-text semantic alignment rather than geometric reasoning. We design a controlled empirical protocol to test this claim by measuring whether rotations can be recovered from encoder representations. In particular, we examine SigLIP and ViT features from LLaVA OneVision and Qwen2.5-VL-7B-Instruct models, respectively, using full images, and examine CLIP representations in LLaVA 1.5 and 1.6 using rotated foreground patches against natural background images. Our null hypothesis is that orientation information is not preserved in the encoder embeddings and we test this by training linear regressors to predict object orientation from encoded features. Contrary to the hypothesis, we find that orientation information is recoverable from encoder representations: simple linear models accurately predict object orientations from embeddings. This contradicts the assumption that MLLM orientation failures originate in the visual encoder. Having rejected the accepted hypothesis that MLLMs struggle with 2D orientation tasks because of visual encoder limitations, we still don’t know why they fail. Although a full explanation is beyond the scope of this paper, we show that although present, orientation information is spread diffusely across tens of thousands of features. This may or may not be while MLLMs fail to exploit the available orientation information.

关键词: Multimodal Large Language Models, MLLMs, visual encoder, object orientation, 2D reasoning, CLIP, SigLIP, linear regression

227. ❌ Deep Spatially-Regularized and Superpixel-Based Diffusion Learning for Unsupervised Hyperspectral Image Clustering

作者: Vutichart Buranasiri, James M. Murphy 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.13307v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于高光谱图像（HSI）的无监督聚类，提出了一种结合掩码自编码器（UMAE）和扩散学习的算法（DS²DL）。论文的核心是计算机视觉和图像处理领域的方法创新，涉及无监督学习、表示学习、超像素分割和扩散图构建。所有关键词（共27个）均直接针对大语言模型（LLMs）及其相关技术（如训练、对齐、推理优化、智能体等）或特定科学AI应用（如生物信息学）。论文未涉及任何LLM、NLP或对话系统相关内容，也未明确属于生物信息学或化学信息学（AI for Science子领域）。因此，仅“AI for Science OR Bioinformatics OR Cheminformatics”因属于广义科学AI应用获得5分（有一定关联），其余26个关键词完全无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种用于高光谱图像无监督聚类的深度空间正则化超像素扩散学习（DS²DL）框架，通过掩码自编码器学习去噪潜在表示并结合扩散图聚类，在Botswana和KSC数据集上提高了聚类质量和标注准确性。

摘要翻译

本文提出了一种融合掩码深度表征学习与扩散聚类的高光谱图像无监督聚类框架，该框架扩展了基于空间正则化超像素的扩散学习算法（Spatially-Regularized Superpixel-based Diffusion Learning, $S^2DL$）。首先，通过以视觉变换器（Vision Transformer）为骨干的无监督掩码自编码器模型（unsupervised masked autoencoder, UMAE）学习原始高光谱图像的降噪潜在表征。UMAE 综合考虑了空间上下文与长程光谱相关性，并采用掩码机制进行高效预训练，该过程仅需使用一小部分训练像素。随后，利用熵率超像素分割算法（entropy rate superpixel, ERS）将图像分割为超像素，并在压缩后的潜在空间（而非原始高光谱图像空间）中，结合欧氏距离与扩散距离构建空间正则化扩散图。所提出的算法——深度空间正则化超像素扩散学习（Deep Spatially-Regularized Superpixel-based Diffusion Learning, $DS^2DL$）利用了更可靠的扩散距离及后续扩散图构建方式，能更好地反映底层数据流形的内在几何结构，从而提升了标注精度与聚类质量。在博茨瓦纳（Botswana）与肯尼迪航天中心（KSC）数据集上的实验验证了 $DS^2DL$ 的有效性。

摘要 (Abstract)

An unsupervised framework for hyperspectral image (HSI) clustering is proposed that incorporates masked deep representation learning with diffusion-based clustering, extending the Spatially-Regularized Superpixel-based Diffusion Learning ($S^2DL$) algorithm. Initially, a denoised latent representation of the original HSI is learned via an unsupervised masked autoencoder (UMAE) model with a Vision Transformer backbone. The UMAE takes spatial context and long-range spectral correlations into account and incorporates an efficient pretraining process via masking that utilizes only a small subset of training pixels. In the next stage, the entropy rate superpixel (ERS) algorithm is used to segment the image into superpixels, and a spatially regularized diffusion graph is constructed using Euclidean and diffusion distances within the compressed latent space instead of the HSI space. The proposed algorithm, Deep Spatially-Regularized Superpixel-based Diffusion Learning ($DS^2DL$), leverages more faithful diffusion distances and subsequent diffusion graph construction that better reflect the intrinsic geometry of the underlying data manifold, improving labeling accuracy and clustering quality. Experiments on Botswana and KSC datasets demonstrate the efficacy of $DS^2DL$.

关键词: unsupervised hyperspectral image clustering, masked autoencoder, vision transformer, superpixel segmentation, diffusion graph, latent representation learning, spatial regularization, entropy rate superpixel

228. ❌ Bias at the End of the Score

作者: Salma Abdel Magid, Grace Guo, Esin Tureci, Amaya Dharmasiri, Vikram V. Ramaswamy, Hanspeter Pfister, Olga Russakovsky 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.13305v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	5.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究奖励模型（RMs）在文本到图像生成系统中的偏见问题，主要涉及模型评估、公平性和偏见缓解。与关键词的相关性分析如下：1）‘Instruction Tuning OR Alignment OR Value Alignment’（5分）：论文讨论奖励模型编码人类偏好（一种对齐形式）及其偏见问题，有一定关联但非核心。2）‘RLHF OR RLAIF OR Direct Preference Optimization OR DPO’（5分）：奖励模型常用于RLHF等对齐方法，论文研究其偏见影响，有一定关联但未深入技术细节。3）‘Hallucination Mitigation OR Factuality OR Truthfulness’（5分）：论文关注奖励模型作为质量指标的可靠性问题，涉及事实性和真实性评估，有一定关联。4）‘Mechanistic Interpretability OR Explainable AI’（5分）：论文通过审计揭示奖励模型的偏见机制，涉及可解释性分析，有一定关联。其他关键词（如LLMs、MoE、Scaling Laws等）与论文的文本到图像生成和奖励模型焦点无关，得0分。

!!! tip deepseek-chat TL;DR

该论文通过大规模审计发现，文本到图像生成系统中的奖励模型编码了人口统计学偏见，导致优化过程加剧性别/种族刻板印象并减少多样性，挑战了其作为质量指标的可靠性。

摘要翻译

奖励模型（Reward Models, RMs）本质上是非中立的价值函数，其设计与训练旨在编码特定目标，例如人类偏好或图文对齐。在文生图（Text-to-Image, T2I）生成系统中，奖励模型已成为关键组成部分，被应用于多个阶段：包括数据集过滤、作为评估指标、在参数优化过程中提供监督信号，以及对T2I输出进行生成后的安全与质量过滤。尽管将奖励模型整合进T2I流程中的具体问题（如奖励黑客攻击或模式崩溃）已得到研究，但其作为评分函数的鲁棒性与公平性在很大程度上仍属未知。我们针对T2I模型训练与生成过程中的人口统计学偏见，对奖励模型的鲁棒性进行了大规模审计。我们提供了定量与定性证据，表明尽管奖励模型最初是作为质量度量工具开发的，但其编码了人口统计学偏见，这导致奖励引导的优化过程会不成比例地将女性图像主体性化、强化性别/种族刻板印象，并导致人口多样性坍缩。这些发现凸显了当前奖励模型的缺陷，对其作为质量度量指标的可靠性提出了质疑，并强调需要改进数据收集与训练流程，以实现更稳健的评分。

摘要 (Abstract)

Reward models (RMs) are inherently non-neutral value functions designed and trained to encode specific objectives, such as human preferences or text-image alignment. RMs have become crucial components of text-to-image (T2I) generation systems where they are used at various stages for dataset filtering, as evaluation metrics, as a supervisory signal during optimization of parameters, and for post-generation safety and quality filtering of T2I outputs. While specific problems with the integration of RMs into the T2I pipeline have been studied (e.g. reward hacking or mode collapse), their robustness and fairness as scoring functions remains largely unknown. We conduct a large scale audit of RM robustness with respect to demographic biases during T2I model training and generation. We provide quantitative and qualitative evidence that while originally developed as quality measures, RMs encode demographic biases, which cause reward-guided optimization to disproportionately sexualize female image subjects reinforce gender/racial stereotypes, and collapse demographic diversity. These findings highlight shortcomings in current reward models, challenge their reliability as quality metrics, and underscore the need for improved data collection and training procedures to enable more robust scoring.

关键词: Reward Models, Text-to-Image Generation, Demographic Bias, Fairness, Model Robustness, Human Preferences, Evaluation Metrics, Stereotypes

229. ❌ Complex Interpolation of Matrices with an application to Multi-Manifold Learning

作者: Adi Arbel, Stefan Steinerberger, Ronen Talmon 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.14118v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究矩阵插值的谱性质及其在多流形学习中的应用，属于纯数学和机器学习理论领域，与所有评分关键词（均涉及大模型、深度学习技术原理或AI科学应用）无直接关联，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了对称正定矩阵插值A^(1-x)B^x的谱性质，建立了算子范数对数线性与共享特征向量存在性的等价关系，并基于此提出了一个识别多视图数据中共同和不同潜在结构的多流形学习框架。

摘要翻译

给定两个对称正定矩阵 $A, B \in \mathbb{R}^{n \times n}$，我们研究插值 $A^{1-x} B^x$（其中 $0 \leq x \leq 1$）的谱性质。通过这一插值视角，可以探究 $A$ 与 $B$ 中存在的“共同结构”，即指向相似方向的特征向量。一般而言，算子范数 $|A^{1-x} B^x|$ 的精确对数线性等价于原始矩阵存在共享特征向量；稳定性分析表明，近似的对数线性会迫使主奇异向量与两个矩阵的主导特征向量对齐。这些结果催生了一个多流形学习框架，并为该框架提供了理论依据，该框架旨在识别多视图数据中共有的和独特的潜在结构。

摘要 (Abstract)

Given two symmetric positive-definite matrices $A, B \in \mathbb{R}^{n \times n}$, we study the spectral properties of the interpolation $A^{1-x} B^x$ for $0 \leq x \leq 1$. The presence of `common structures’ in $A$ and $B$, eigenvectors pointing in a similar direction, can be investigated using this interpolation perspective. Generically, exact log-linearity of the operator norm $|A^{1-x} B^x|$ is equivalent to the existence of a shared eigenvector in the original matrices; stability bounds show that approximate log-linearity forces principal singular vectors to align with leading eigenvectors of both matrices. These results give rise to and provide theoretical justification for a multi-manifold learning framework that identifies common and distinct latent structures in multiview data.

关键词: matrix interpolation, spectral properties, symmetric positive-definite matrices, operator norm, shared eigenvectors, multi-manifold learning, multiview data, latent structures

230. ❌ ID and Graph View Contrastive Learning with Multi-View Attention Fusion for Sequential Recommendation

作者: Xiaofan Zhou, Kyumin Lee 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.14114v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于序列推荐系统，提出了一种结合ID视图和图视图的多视图对比学习框架（MVCrec）。虽然论文使用了深度学习技术（如对比学习、图神经网络、注意力机制），但其研究内容与所有评分关键词均无直接关联。所有关键词都特定于大语言模型（LLMs）及其相关技术（如MoE、RLHF、RAG、量化等）或AI for Science领域，而本论文研究的是传统的推荐系统问题，未涉及大语言模型、科学AI应用或任何评分关键词中的具体技术。

!!! tip deepseek-chat TL;DR

该论文提出了一种用于序列推荐的多视图对比学习框架（MVCrec），通过整合ID视图和图视图的互补信号，在五个真实数据集上显著超越了现有基线方法。

摘要翻译

序列推荐在学术界与工业界（尤其是电子商务领域）日益受到重视，其核心目标是从用户历史交互序列中提取偏好，并预测用户可能感兴趣的下一个项目。近期研究通过对比学习与图神经网络从交互历史中学习更具表现力的表征——图结构捕捉节点间的关系结构，而基于ID的表征则编码项目特定信息。然而，现有研究很少探索基于ID的视角与图视角之间的多视图对比学习，以共同优化用户和项目表征，特别是在仅有交互数据而缺乏辅助信息的场景中。
为填补这一空白，我们提出用于序列推荐的多视图对比学习框架（MVCrec），该框架整合了来自序列（基于ID）视图与基于图视图的互补信号。MVCrec包含三个对比学习目标：序列视图内部、图视图内部以及跨视图对比。为有效融合学习到的表征，我们引入一个多视图注意力融合模块，该模块结合全局与局部注意力机制来估计目标用户购买目标项目的可能性。在五个真实世界基准数据集上的综合实验表明，MVCrec始终优于11个前沿基线模型，在NDCG@10和HitRatio@10指标上分别较最强基线提升最高达14.44%和9.22%。我们的代码与数据集已公开于https://github.com/sword-Lz/MMCrec。

摘要 (Abstract)

Sequential recommendation has become increasingly prominent in both academia and industry, particularly in e-commerce. The primary goal is to extract user preferences from historical interaction sequences and predict items a user is likely to engage with next. Recent advances have leveraged contrastive learning and graph neural networks to learn more expressive representations from interaction histories – graphs capture relational structure between nodes, while ID-based representations encode item-specific information. However, few studies have explored multi-view contrastive learning between ID and graph perspectives to jointly improve user and item representations, especially in settings where only interaction data is available without auxiliary information. To address this gap, we propose Multi-View Contrastive learning for sequential recommendation (MVCrec), a framework that integrates complementary signals from both sequential (ID-based) and graph-based views. MVCrec incorporates three contrastive objectives: within the sequential view, within the graph view, and across views. To effectively fuse the learned representations, we introduce a multi-view attention fusion module that combines global and local attention mechanisms to estimate the likelihood of a target user purchasing a target item. Comprehensive experiments on five real-world benchmark datasets demonstrate that MVCrec consistently outperforms 11 state-of-the-art baselines, achieving improvements of up to 14.44% in NDCG@10 and 9.22% in HitRatio@10 over the strongest baseline. Our code and datasets are available at https://github.com/sword-Lz/MMCrec.

关键词: sequential recommendation, contrastive learning, graph neural networks, multi-view learning, attention fusion, user-item interaction, representation learning, recommendation systems

231. ❌ Momentum Further Constrains Sharpness at the Edge of Stochastic Stability

作者: Arseniy Andreyev, Advikar Ananthkumar, Marc Walden, Tomaso Poggio, Pierfrancesco Beneventano 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.14108v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是深度学习优化算法（SGD with momentum）的理论分析，聚焦于动量、批量大小与稳定性边界（Edge of Stochastic Stability）的关系，属于深度学习技术原理的基础理论研究。所有关键词均与大模型应用、训练方法、推理优化、对齐、代理系统、科学AI应用等具体领域无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了带动量的随机梯度下降（SGD）在优化过程中如何自组织到随机稳定性边界附近，并揭示了动量在不同批量大小下对批量锐度（Batch Sharpness）的差异化影响：小批量时动量放大随机波动倾向于更平坦区域，大批量时动量恢复经典稳定作用倾向于更尖锐区域。

摘要翻译

近期研究表明，（随机）梯度下降法在接近不稳定性边界时会自发组织，从而影响优化过程及最终所得解。动量法与迷你批次梯度下降在实际深度学习优化中被广泛采用，但其是否在类似的不稳定性机制中运作尚不明确。本文证明，带动量的随机梯度下降法表现出一种类似随机稳定性边缘（Edge of Stochastic Stability, EoSS）的机制，其行为依赖于批次大小，无法通过单一动量调整的稳定性阈值来解释。批次锐度（即期望方向性迷你批次曲率）在两个不同机制中趋于稳定：在小批次规模下，它收敛至较低平台值 $2(1-β)/η$，这反映了动量对随机波动的放大作用，并倾向于选择比普通随机梯度下降更平坦的区域；在大批次规模下，它收敛至较高平台值 $2(1+β)/η$，此时动量恢复其经典稳定效应，倾向于选择与全批次动态一致的更尖锐区域。我们进一步证明该现象与线性稳定性阈值相符，并讨论了其对超参数调优及耦合机制的启示。

摘要 (Abstract)

Recent work suggests that (stochastic) gradient descent self-organizes near an instability boundary, shaping both optimization and the solutions found. Momentum and mini-batch gradients are widely used in practical deep learning optimization, but it remains unclear whether they operate in a comparable regime of instability. We demonstrate that SGD with momentum exhibits an Edge of Stochastic Stability (EoSS)-like regime with batch-size-dependent behavior that cannot be explained by a single momentum-adjusted stability threshold. Batch Sharpness (the expected directional mini-batch curvature) stabilizes in two distinct regimes: at small batch sizes it converges to a lower plateau $2(1-β)/η$, reflecting amplification of stochastic fluctuations by momentum and favoring flatter regions than vanilla SGD; at large batch sizes it converges to a higher plateau $2(1+β)/η$, where momentum recovers its classical stabilizing effect and favors sharper regions consistent with full-batch dynamics. We further show that this aligns with linear stability thresholds and discuss the implications for hyperparameter tuning and coupling.

关键词: stochastic gradient descent, momentum, Edge of Stochastic Stability, batch size, batch sharpness, optimization, stability, deep learning

232. ❌ Multistage Conditional Compositional Optimization

作者: Buse Şen, Yifan Hu, Daniel Kuhn 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.14075v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是多阶段条件组合优化（MCCO）这一数学优化范式，属于运筹学、随机优化和决策理论领域。论文内容完全不涉及大模型、深度学习、AI技术原理或AI在科学领域的应用。所有关键词都聚焦于大模型技术及其应用，与论文的数学优化主题完全无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

论文提出了多阶段条件组合优化（MCCO）这一新的不确定性决策范式，并开发了多项式复杂度增长的多层蒙特卡洛技术来解决传统方法面临的维度灾难问题。

摘要翻译

我们提出多阶段条件组合优化（Multistage Conditional Compositional Optimization，简称MCCO）作为不确定性下决策制定的新范式，它融合了多阶段随机规划与条件随机优化的特点。MCCO旨在最小化一系列嵌套的条件期望与非线性成本函数。该方法具有广泛的应用场景，例如在最优停止、线性二次调节器问题、分布鲁棒上下文赌博机以及涉及动态风险度量的各类问题中均有体现。针对MCCO的传统嵌套采样方法会遭遇基于情景树的多阶段随机规划中常见的维度灾难问题，即其情景复杂度会随着嵌套层数呈指数级增长。我们为MCCO开发了新的多级蒙特卡洛技术，其情景复杂度仅随目标精度呈多项式增长。

摘要 (Abstract)

We introduce Multistage Conditional Compositional Optimization (MCCO) as a new paradigm for decision-making under uncertainty that combines aspects of multistage stochastic programming and conditional stochastic optimization. MCCO minimizes a nest of conditional expectations and nonlinear cost functions. It has numerous applications and arises, for example, in optimal stopping, linear-quadratic regulator problems, distributionally robust contextual bandits, as well as in problems involving dynamic risk measures. The naïve nested sampling approach for MCCO suffers from the curse of dimensionality familiar from scenario tree-based multistage stochastic programming, that is, its scenario complexity grows exponentially with the number of nests. We develop new multilevel Monte Carlo techniques for MCCO whose scenario complexity grows only polynomially with the desired accuracy.

关键词: Multistage Conditional Compositional Optimization, decision-making under uncertainty, multistage stochastic programming, conditional stochastic optimization, multilevel Monte Carlo, curse of dimensionality, scenario complexity, optimal stopping

233. ❌ Neural architectures for resolving references in program code

作者: Gergő Szalay, Gergely Zsolt Kovács, Sándor Teleki, Balázs Pintér, Tibor Gregorics 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.14073v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于程序代码中引用解析的特定任务，提出新的序列到序列架构来解决直接和间接索引问题，并应用于反编译switch语句。论文内容完全围绕编程语言处理、序列模型架构和特定应用任务展开，未涉及任何大模型、深度学习技术原理创新、AI for Science或关键词列表中提到的其他大模型相关技术。所有关键词均与论文主题无关。

!!! tip deepseek-chat TL;DR

该论文针对程序代码中引用解析的索引问题，提出了新的序列到序列架构，在合成基准测试和实际反编译任务中显著提升了模型的鲁棒性、可扩展性和准确性。

摘要翻译

解析与重写引用是编程语言中的基础问题。受实际反编译任务的启发，我们将引用重写抽象为通过排列进行直接索引与间接索引的问题。针对这些任务，我们创建了合成基准测试，并证明经典的序列到序列机器学习架构在这些基准上表现不佳。针对这两个问题，我们提出了新的序列到序列架构。实验测量表明，我们的架构在鲁棒性和可扩展性上均优于基线模型：我们的模型能够处理比最佳基线长十倍的示例。我们在反编译switch语句（其包含索引子任务）的实际任务中评估了我们架构的影响。根据测量结果，扩展模型将错误率降低了42%。多项消融研究表明，我们架构中的所有组件都是必不可少的。

摘要 (Abstract)

Resolving and rewriting references is fundamental in programming languages. Motivated by a real-world decompilation task, we abstract reference rewriting into the problems of direct and indirect indexing by permutation. We create synthetic benchmarks for these tasks and show that well-known sequence-to-sequence machine learning architectures are struggling on these benchmarks. We introduce new sequence-to-sequence architectures for both problems. Our measurements show that our architectures outperform the baselines in both robustness and scalability: our models can handle examples that are ten times longer compared to the best baseline. We measure the impact of our architecture in the real-world task of decompiling switch statements, which has an indexing subtask. According to our measurements, the extended model decreases the error rate by 42%. Multiple ablation studies show that all components of our architectures are essential.

关键词: reference resolution, program code, sequence-to-sequence architectures, indexing, decompilation, switch statements, robustness, scalability

234. ❌ A Complete Symmetry Classification of Shallow ReLU Networks

作者: Pranavkrishnan Ramakrishnan 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.14037v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究浅层ReLU网络的参数对称性分类，属于神经网络理论分析领域，与所有评分关键词（主要关注大模型技术、训练方法、推理优化、应用等）均无直接关联。论文未涉及大模型、深度学习应用、或任何评分关键词中的具体技术。

!!! tip deepseek-chat TL;DR

该论文解决了浅层ReLU网络参数对称性的完整分类问题，利用ReLU的非可微性首次实现了该激活函数下的对称性完全分类。

摘要翻译

参数空间并非神经网络架构的函数空间。这一事实早在20世纪90年代就以“逆向工程”或“参数可辨识性”等术语被研究，并自然引出了对参数空间对称性的探讨——即研究实现相同函数的不同神经网络参数。事实上，通过识别产生相同函数的参数所得到的商空间（称为神经流形）在某些情况下已被证明具有丰富的几何性质，并影响优化动态。迄今为止，实现完全分类的技术要求激活函数的解析性，这尤其排除了重要的ReLU情形。与此相反，本文利用ReLU激活函数的不可微性，为浅层网络情形下的对称性提供了完整分类。

摘要 (Abstract)

Parameter space is not function space for neural network architectures. This fact, investigated as early as the 1990s under terms such as reverse engineering," or parameter identifiability”, has led to the natural question of parameter space symmetries\textemdash the study of distinct parameters in neural architectures which realize the same function. Indeed, the quotient space obtained by identifying parameters giving rise to the same function, called the \textit{neuromanifold}, has been shown in some cases to have rich geometric properties, impacting optimization dynamics. Thus far, techniques towards complete classifications have required the analyticity of the activation function, notably excising the important case of ReLU. Here, in contrast, we exploit the non-differentiability of the ReLU activation to provide a complete classification of the symmetries in the shallow case.

关键词: neural networks, ReLU activation, parameter symmetries, neuromanifold, shallow networks, reverse engineering, parameter identifiability, non-differentiability

235. ❌ A Comparative Study of Dynamic Programming and Reinforcement Learning in Finite Horizon Dynamic Pricing

作者: Lev Razumovskiy, Nikolay Karenin 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.14059v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究有限视野动态定价问题，比较动态规划（DP）和强化学习（RL）方法，属于传统运筹学、优化和机器学习领域，但完全不涉及大语言模型（LLMs）、深度学习、大模型技术原理或其在科学领域的应用。所有关键词均与大模型、深度学习、AI for Science等相关，而本文专注于经典动态定价算法比较，无任何关联。

!!! tip deepseek-chat TL;DR

该论文系统比较了有限视野动态定价问题中基于数据估计需求的拟合动态规划（DP）与强化学习（RL）方法在不同结构复杂度环境下的性能，包括收入表现、稳定性、约束满足和计算扩展性，揭示了基于显式期望的优化与基于轨迹的学习之间的权衡。

摘要翻译

本文对有限时域动态定价问题中基于数据估计需求的拟合动态规划与强化学习方法进行了系统性比较。我们分析了它们在结构复杂度递增环境中的性能表现，涵盖从单一类型基准到具有异质性需求及跨期收益约束的多类型场景。与将动态规划局限于低维环境的简化比较不同，本研究将动态规划应用于具有多产品类型和约束的更丰富多维环境中。我们评估了收益表现、稳定性、约束满足行为以及计算复杂度，重点揭示了基于显式期望的优化方法与基于轨迹的学习方法之间的权衡关系。

摘要 (Abstract)

This paper provides a systematic comparison between Fitted Dynamic Programming (DP), where demand is estimated from data, and Reinforcement Learning (RL) methods in finite-horizon dynamic pricing problems. We analyze their performance across environments of increasing structural complexity, ranging from a single typology benchmark to multi-typology settings with heterogeneous demand and inter-temporal revenue constraints. Unlike simplified comparisons that restrict DP to low-dimensional settings, we apply dynamic programming in richer, multi-dimensional environments with multiple product types and constraints. We evaluate revenue performance, stability, constraint satisfaction behavior, and computational scaling, highlighting the trade-offs between explicit expectation-based optimization and trajectory-based learning.

关键词: Dynamic Programming, Reinforcement Learning, Dynamic Pricing, Finite Horizon, Multi-typology, Constraint Satisfaction, Computational Scaling, Revenue Performance

236. ❌ Stochastic Trust-Region Methods for Over-parameterized Models

作者: Aike Yang, Hao Wang 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.14017v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于随机优化算法（特别是随机信赖域方法）的理论分析和应用，用于过参数化模型的训练。虽然提到了深度神经网络训练作为实验验证，但核心内容是关于优化算法本身（收敛率、复杂度分析、约束处理），而非大模型技术、深度学习原理创新或特定领域应用。所有关键词均涉及大模型相关技术（架构、训练、推理、应用等），与论文的优化算法主题完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种随机信赖域优化框架，用于解决过参数化模型的训练问题，在强增长条件下实现了收敛率分析，并通过深度神经网络和正交约束子空间拟合实验验证了方法的有效性。

摘要翻译

在插值型假设（如强增长条件）下，随机优化方法虽能达到与全批量方法相当的收敛速率，但其性能（尤其是随机梯度下降法）仍对步长选择高度敏感。为解决此问题，我们提出了一种统一的随机信赖域框架，该框架无需手动调整步长，并能自然扩展至等式约束问题。针对无约束优化，我们提出了一阶随机信赖域算法，并证明在强增长条件下，该算法为找到ε-平稳点所需的迭代次数和随机一阶Oracle复杂度均为$O(\varepsilon^{-2} \log(1/\varepsilon))$。对于等式约束问题，我们引入了一种基于二次惩罚的随机信赖域方法，其惩罚参数为μ，并证明该算法达到惩罚问题ε-平稳点所需的迭代次数和Oracle复杂度为$O(\varepsilon^{-4} \log(1/\varepsilon))$，这对应于原约束问题的$O(\varepsilon)$-近似KKT点。在深度神经网络训练和正交约束子空间拟合上的数值实验表明，所提方法在达到与精心调参的随机基线方法相当性能的同时，展现出稳定的优化行为，且无需手动调整学习率即可有效处理硬约束。

摘要 (Abstract)

Under interpolation-type assumptions such as the strong growth condition, stochastic optimization methods can attain convergence rates comparable to full-batch methods, but their performance, particularly for SGD, remains highly sensitive to step-size selection. To address this issue, we propose a unified stochastic trust-region framework that eliminates manual step-size tuning and extends naturally to equality-constrained problems. For unconstrained optimization, we develop a first-order stochastic trust-region algorithm and show that, under the strong growth condition, it achieves an iteration and stochastic first-order oracle complexity of $O(\varepsilon^{-2} \log(1/\varepsilon))$ for finding an $\varepsilon$-stationary point. For equality-constrained problems, we introduce a quadratic-penalty-based stochastic trust-region method with penalty parameter $μ$, and establish an iteration and oracle complexity of $O(\varepsilon^{-4} \log(1/\varepsilon))$ to reach an $\varepsilon$-stationary point of the penalized problem, corresponding to an $O(\varepsilon)$-approximate KKT point of the original constrained problem. Numerical experiments on deep neural network training and orthogonally constrained subspace fitting demonstrate that the proposed methods achieve performance comparable to well-tuned stochastic baselines, while exhibiting stable optimization behavior and effectively handling hard constraints without manual learning-rate scheduling.

关键词: stochastic trust-region methods, over-parameterized models, optimization algorithms, convergence analysis, deep neural network training, equality-constrained problems, KKT point, oracle complexity

237. ❌ Unsupervised domain transfer: Overcoming signal degradation in sleep monitoring by increasing scoring realism

作者: Mohammad Ahangarkiasari, Andreas Tind Damgaard, Casper Haurum, Kaare B. Mikkelsen 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13988v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文主要研究睡眠监测中的无监督域适应方法，使用预训练的u-sleep模型结合判别器网络处理信号退化问题。与大多数关键词无关，因为论文专注于特定生物医学应用而非通用大模型技术。唯一高度相关的关键词是’Pre-training OR Continual Pre-training OR Domain Adaptation’（10分），因为论文核心是使用预训练模型进行域适应。‘AI for Science OR Bioinformatics OR Cheminformatics’（8分）相关，因为论文属于生物信息学/医疗AI应用。其他关键词均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于判别器引导微调的无监督域适应方法，用于处理移动睡眠监测中的信号退化问题，实验表明该方法能提升性能但未达到理论最优。

摘要翻译

目的：探究是否可利用睡眠分期图“真实性”指导无监督方法，以处理移动睡眠监测中任意类型的信号退化问题。
方法：将预训练的最先进“u-sleep”模型与“判别器”网络相结合，使目标域特征与预训练期间学习的特征空间对齐。为验证该方法，我们通过模拟真实信号退化对源域数据进行扭曲，以测试模型对不同类型退化的适应能力。最终将所得模型的性能与针对每类迁移任务以监督方式设计的最佳模型进行比较。
主要结果：根据失真类型的不同，无监督方法可将Cohen’s kappa系数提升0.03至0.29；在所有迁移任务中，该方法均未降低模型性能。然而，该方法始终未能达到预估的理论最优性能，且在针对两项真实睡眠研究间的域失配测试中，其改善效果不显著。
意义：“判别器引导的微调”为处理“野外”睡眠监测中的信号退化问题提供了具有潜力的新思路，其揭示的睡眠数据普遍特性尤为值得关注。但该方法在投入实际应用前仍需进一步改进。

摘要 (Abstract)

Objective: Investigate whether hypnogram ‘realism’ can be used to guide an unsupervised method for handling arbitrary types of signal degradation in mobile sleep monitoring. Approach: Combining a pretrained, state-of-the-art ‘u-sleep’ model with a ‘discriminator’ network, we align features from a target domain with a feature space learned during pretraining. To test the approach, we distort the source domain with realistic signal degradations, to see how well the method can adapt to different types of degradation. We compare the performance of the resulting model with best-case models designed in a supervised manner for each type of transfer. Main Results: Depending on the type of distortion, we find that the unsupervised approach can increase Cohen’s kappa with as little as 0.03 and up to 0.29, and that for all transfers, the method does not decrease performance. However, the approach never quite reaches the estimated theoretical optimal performance, and when tested on a real-life domain mismatch between two sleep studies, the benefit was insignificant. Significance: ‘Discriminator-guided fine tuning’ is an interesting approach to handling signal degradation for ‘in the wild’ sleep monitoring, with some promise. In particular, what it says about sleep data in general is interesting. However, more development will be necessary before using it ‘in production’.

关键词: unsupervised domain transfer, sleep monitoring, signal degradation, hypnogram realism, discriminator-guided fine tuning, u-sleep model, feature alignment, Cohen’s kappa

238. ❌ Physics-Informed Neural Networks for Methane Sorption: Cross-Gas Transfer Learning, Ensemble Collapse Under Physics Constraints, and Monte Carlo Dropout Uncertainty Quantification

作者: Mohammad Nooraiepour, Zezhang Song, Wei Li, Sarah Perez 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13992v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文主要研究物理信息神经网络（PINN）在甲烷吸附预测中的应用，属于AI for Science领域，因此与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。论文中使用了SHAP和ALE进行可解释性分析，与’Mechanistic Interpretability OR Explainable AI’有一定关联（5分）。论文未涉及大语言模型、MoE、小模型、缩放定律、预训练、后训练、对齐、RLHF、PEFT、RAG、上下文扩展、注意力优化、推理方法、智能体、量化、推理加速、幻觉缓解、世界模型、模型合并、上下文学习等主题，因此这些关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种物理信息迁移学习框架，将氢吸附PINN通过弹性权重巩固和课程学习迁移到甲烷吸附预测，在993个煤样测量数据上实现了R2=0.932的准确率，并发现蒙特卡洛Dropout是该物理约束框架中性能最佳的不确定性量化方法。

摘要翻译

准确预测跨不同煤阶的非均质煤体甲烷吸附行为，需要模型兼具热力学一致性、在数据稀缺地质系统间高效进行知识迁移的能力，以及经过校准的不确定性估计功能，这些能力在现有框架中鲜少被同时考量。本文提出一种物理信息驱动的迁移学习框架，该框架通过弹性权重巩固、煤特异性特征工程以及一个逐步平衡迁移保持与热力学微调的三阶段课程学习策略，将氢气吸附的物理信息神经网络（PINN）适配于甲烷吸附预测。该框架在涵盖褐煤至无烟煤的114个独立煤样实验、总计993个平衡测量数据上训练，在留出煤样测试集上取得了R² = 0.932的预测精度，相较于仅基于压力的经典等温线模型性能提升了227%。同时，与随机初始化相比，氢气预训练使模型均方根误差降低了18.9%，收敛速度加快了19.4%。五种贝叶斯不确定性量化（UQ）方法的对比揭示了不同物理约束架构在性能上的系统性差异：蒙特卡洛丢弃法以最小计算开销实现了良好校准的不确定性估计，而深度集成方法——无论其架构多样性或初始化策略如何——均表现出性能下降，这是因为共享的物理约束缩小了可接受的解流形。SHAP和ALE分析证实，学习到的表征保持物理可解释性，并与既定的煤吸附机制一致：水分-挥发分相互作用最具影响力，压力-温度耦合捕捉了热力学协同依赖性，且特征呈现非单调效应。这些结果确立了蒙特卡洛丢弃法为此物理约束迁移学习框架中性能最佳的不确定性量化方法，并证明了跨气体迁移学习是地质材料建模中一种数据高效策略。

摘要 (Abstract)

Accurate methane sorption prediction across heterogeneous coal ranks requires models that combine thermodynamic consistency, efficient knowledge transfer across data-scarce geological systems, and calibrated uncertainty estimates, capabilities that are rarely addressed together in existing frameworks. We present a physics-informed transfer learning framework that adapts a hydrogen sorption PINN to methane sorption prediction via Elastic Weight Consolidation, coal-specific feature engineering, and a three-phase curriculum that progressively balances transfer preservation with thermodynamic fine-tuning. Trained on 993 equilibrium measurements from 114 independent coal experiments spanning lignite to anthracite, the framework achieves R2 = 0.932 on held-out coal samples, a 227% improvement over pressure-only classical isotherms, while hydrogen pre-training delivers 18.9% lower RMSE and 19.4% faster convergence than random initialization. Five Bayesian uncertainty quantification approaches reveal a systematic divergence in performance across physics-constrained architectures. Monte Carlo Dropout achieves well-calibrated uncertainty at minimal overhead, while deep ensembles, regardless of architectural diversity or initialization strategy, exhibit performance degradation because shared physics constraints narrow the admissible solution manifold. SHAP and ALE analyses confirm that learned representations remain physically interpretable and aligned with established coal sorption mechanisms: moisture-volatile interactions are most influential, pressure-temperature coupling captures thermodynamic co-dependence, and features exhibit non-monotonic effects. These results identify Monte Carlo Dropout as the best-performing UQ method in this physics-constrained transfer learning framework, and demonstrate cross-gas transfer learning as a data-efficient strategy for geological material modeling.

关键词: Physics-Informed Neural Networks, Methane Sorption, Transfer Learning, Monte Carlo Dropout, Uncertainty Quantification, Coal Sorption, Cross-Gas Transfer, Physics Constraints

239. ❌ PRiMeFlow: Capturing Complex Expression Heterogeneity in Perturbation Response Modelling

作者: Zichao Yan, Yan Wu, Mica Xu Ji, Chaitra Agrahar, Esther Wershof, Marcel Nassar, Mehrshad Sadria, Ridvan Eksi, Vladimir Trifonov, Ignacio Ibarra, Telmo Felgueira, Błażej Osiński, Rory Stark 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13986v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文PRiMeFlow专注于使用基于流匹配的深度学习方法来建模单细胞基因表达扰动响应，属于生物信息学/计算生物学领域。它不涉及任何大语言模型（LLM）技术、架构、训练方法、推理优化、对齐、代理系统或通用AI技术。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该论文是深度学习在生物信息学（具体是单细胞组学）中的应用，属于AI for Science范畴，因此给予10分。其他所有关键词均与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了PRiMeFlow，一种基于流匹配的端到端深度学习模型，用于直接建模遗传和小分子扰动在单细胞基因表达空间中的效应，以解决单细胞基因表达异质性和复杂基因依赖性的建模挑战，并在基准测试中展示了其准确逼近经验分布的能力。

摘要翻译

在计算机中预测扰动对细胞状态的影响，能够大规模识别细胞行为的驱动因子并加速药物发现。然而，由于单细胞基因表达固有的异质性以及复杂、潜在的基因依赖性，建模仍面临挑战。本文提出PRiMeFlow，一种基于端到端流匹配（flow matching）的方法，可直接在基因表达空间中模拟遗传扰动与小分子扰动的影响。PRiMeFlow采用的分布拟合方法使其能够精确逼近单细胞基因表达的经验分布，这一点我们通过在PerturBench内部进行的广泛基准测试予以验证。通过消融研究，我们也验证了关键的模型设计选择，例如在基因表达空间中进行操作，以及使用U-Net架构对速度场进行参数化。PRiMeFlow架构被用作赢得首届ARC虚拟细胞挑战赛（ARC Virtual Cell Challenge）通用模型奖（Generalist Prize）的基础模型。

摘要 (Abstract)

Predicting the effects of perturbations in-silico on cell state can identify drivers of cell behavior at scale and accelerate drug discovery. However, modeling challenges remain due to the inherent heterogeneity of single cell gene expression and the complex, latent gene dependencies. Here, we present PRiMeFlow, an end-to-end flow matching based approach to directly model the effects of genetic and small molecule perturbations in the gene expression space. The distribution-fitting approach taken by PRiMeFlow enables it to accurately approximate the empirical distribution of single-cell gene expression, which we demonstrate through extensive benchmarking inside PerturBench. Through ablation studies, we also validate important model design choices such as operating in gene expression space and parameterizing the velocity field with a U-Net architecture. The PRiMeFlow architecture was used as the basis for the model that won the Generalist Prize in the first ARC Virtual Cell Challenge.

关键词: perturbation response modeling, single-cell gene expression, flow matching, deep learning, bioinformatics, drug discovery, U-Net architecture, gene expression space

240. ❌ BOAT: Navigating the Sea of In Silico Predictors for Antibody Design via Multi-Objective Bayesian Optimization

作者: Jackie Rao, Ferran Gonzalez Hernandez, Leon Gerard, Alexandra Gessner 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13980v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文专注于抗体设计的贝叶斯优化框架，属于AI在生物信息学/科学领域的应用，仅与关键词’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分），其他关键词均未涉及（0分）。

!!! tip deepseek-chat TL;DR

该研究提出了一个名为BOAT的贝叶斯优化框架，用于多目标抗体工程，通过结合不确定性感知代理模型和遗传算法，有效优化抗体特性，并在系统基准测试中展示了与最先进方法竞争的性能。

摘要翻译

抗体先导化合物优化本质上是药物发现中的一个多目标挑战。在不同类药特性之间取得平衡对于开发可行的候选药物至关重要，而随着所需特性的增加，这种搜索的难度呈指数级增长。日益增多的复杂计算机模拟抗体特性预测工具，亟需一种高效的联合优化流程来克服资源密集型的顺序筛选流程。我们提出BOAT，一个用于多特性抗体工程的通用贝叶斯优化框架。我们的“即插即用”框架将不确定性感知的代理模型与遗传算法相结合，以联合优化多种预测的抗体特性，同时实现对序列空间的高效探索。通过对遗传算法和新型生成式学习方法的系统基准测试，我们证明了该框架在多目标蛋白质优化任务中与最先进方法相比具有竞争力。我们明确了代理驱动优化优于昂贵生成式方法的适用场景，并确定了由序列维度和评估成本所施加的实际限制。

摘要 (Abstract)

Antibody lead optimization is inherently a multi-objective challenge in drug discovery. Achieving a balance between different drug-like properties is crucial for the development of viable candidates, and this search becomes exponentially challenging as desired properties grow. The ever-growing zoo of sophisticated in silico tools for predicting antibody properties calls for an efficient joint optimization procedure to overcome resource-intensive sequential filtering pipelines. We present BOAT, a versatile Bayesian optimization framework for multi-property antibody engineering. Our `plug-and-play’ framework couples uncertainty-aware surrogate modeling with a genetic algorithm to jointly optimize various predicted antibody traits while enabling efficient exploration of sequence space. Through systematic benchmarking against genetic algorithms and newer generative learning approaches, we demonstrate competitive performance with state-of-the-art methods for multi-objective protein optimization. We identify clear regimes where surrogate-driven optimization outperforms expensive generative approaches and establish practical limits imposed by sequence dimensionality and oracle costs.

关键词: antibody design, multi-objective optimization, Bayesian optimization, in silico predictors, drug discovery, genetic algorithm, sequence space exploration, protein optimization

241. ❌ Provably Efficient Offline-to-Online Value Adaptation with General Function Approximation

作者: Shangzhe Li, Weitong Zhang 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13966v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究离线到在线强化学习中的价值函数适应问题，使用一般函数逼近方法。虽然涉及预训练和适应过程，但所有关键词都专门针对大语言模型（LLMs）及其相关技术（如微调、对齐、推理、代理等），而本文专注于强化学习中的Q函数适应，未涉及任何语言模型、深度学习架构或大模型技术。因此，所有关键词均完全无关，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了在一般函数逼近下，从离线预训练的Q函数出发，通过有限在线交互适应到目标环境的强化学习问题，提出了O2O-LSVI算法并证明了其样本复杂度优于纯在线强化学习。

摘要翻译

本研究探讨了在一般函数逼近框架下离线至在线强化学习中的价值函数适应问题。学习者从一个不完美的离线预训练$Q$函数出发，仅通过有限量的在线交互使其适应目标环境。我们首先通过建立极小极大下界来刻画该场景的难度，证明即使预训练的$Q$函数接近最优$Q^\star$，在某些困难实例上在线适应的效率可能无法超越纯在线强化学习。在积极的一面，基于离线预训练价值函数的一种新颖结构条件，我们提出了O2O-LSVI适应算法，该算法具有问题依赖的样本复杂度，理论上可证明其性能优于纯在线强化学习。最后，我们通过神经网络实验验证了所提方法的实际有效性，从而对理论分析进行了补充。

摘要 (Abstract)

We study value adaptation in offline-to-online reinforcement learning under general function approximation. Starting from an imperfect offline pretrained $Q$-function, the learner aims to adapt it to the target environment using only a limited amount of online interaction. We first characterize the difficulty of this setting by establishing a minimax lower bound, showing that even when the pretrained $Q$-function is close to optimal $Q^\star$, online adaptation can be no more efficient than pure online RL on certain hard instances. On the positive side, under a novel structural condition on the offline-pretrained value functions, we propose O2O-LSVI, an adaptation algorithm with problem-dependent sample complexity that provably improves over pure online RL. Finally, we complement our theory with neural-network experiments that demonstrate the practical effectiveness of the proposed method.

关键词: offline-to-online reinforcement learning, value adaptation, general function approximation, Q-function, sample complexity, O2O-LSVI algorithm, minimax lower bound

242. ❌ Quantum Machine Learning for Colorectal Cancer Data: Anastomotic Leak Classification and Risk Factors

作者: Vojtěch Novák, Ivan Zelinka, Lenka Přibylová, Lubomír Martínek, Vladimír Benčurík, Martin Beseda 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13951v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于量子机器学习在结直肠癌吻合口漏预测中的应用，属于AI在生物医学领域的应用研究。论文未涉及任何大语言模型、深度学习技术原理或相关关键词（如MoE、SFT、RLHF、RAG等），因此除’AI for Science OR Bioinformatics OR Cheminformatics’（评5分，因属于生物信息学应用）外，其余关键词均评0分。论文核心是量子神经网络与传统模型的比较，而非大模型技术。

!!! tip deepseek-chat TL;DR

该研究比较了量子神经网络与传统模型在结直肠癌吻合口漏预测中的性能，发现量子方法在低患病率数据中具有更高的敏感性（83.3% vs 66.7%）。

摘要翻译

本研究评估结直肠手术风险因素，并比较经典模型与量子神经网络在吻合口瘘预测中的性能。通过分析吻合口瘘发生率为14%的临床数据，我们在模拟噪声环境下测试了采用ZZFeatureMap编码结合RealAmplitudes和EfficientSU2拟设的量子模型。经$F_β$分数优化的量子模型配置显示出显著高于经典基线模型（66.7%）的敏感性（83.3%）。这表明量子特征空间能更有效地识别少数类别，这对于低发生率临床风险预测至关重要。本研究探索了噪声环境下多种优化器的表现，揭示了关键的性能权衡关系，并为未来硬件部署指明了方向。

摘要 (Abstract)

This study evaluates colorectal risk factors and compares classical models against Quantum Neural Networks (QNNs) for anastomotic leak prediction. Analyzing clinical data with 14% leak prevalence, we tested ZZFeatureMap encodings with RealAmplitudes and EfficientSU2 ansatze under simulated noise. $F_β$-optimized quantum configurations yielded significantly higher sensitivity (83.3%) than classical baselines (66.7%). This demonstrates that quantum feature spaces better prioritize minority class identification, which is critical for low-prevalence clinical risk prediction. Our work explores various optimizers under noisy conditions, highlighting key trade-offs and future directions for hardware deployment.

关键词: Quantum Machine Learning, Colorectal Cancer, Anastomotic Leak, Quantum Neural Networks, Clinical Risk Prediction, ZZFeatureMap, RealAmplitudes, EfficientSU2

243. ❌ Unsupervised Anomaly Detection in Process-Complex Industrial Time Series: A Real-World Case Study

作者: Sergej Krasnikov, Lukas Meitz, Samineh Bagheri, Michael Heider, Thorsten Schöler, Jörg Hähner 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13928v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于工业时间序列异常检测的实证研究，使用经典机器学习方法（如Isolation Forest）和自编码器架构，未涉及任何大语言模型、深度学习技术原理创新或AI在科学领域的应用。所有关键词均与大模型、深度学习技术、AI科学应用相关，而本文研究的是传统时间序列分析，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

本文研究了工业时间序列异常检测问题，通过实证比较发现自编码器（特别是时序卷积自编码器）比传统Isolation Forest方法更能有效处理真实工业环境中的复杂、非周期性多尺度动态数据。

摘要翻译

真实生产环境中的工业时序数据展现出远高于常用基准数据集的复杂性，这主要源于异构、多阶段的生产流程。因此，在简化条件下验证有效的异常检测方法往往难以推广到工业场景中。本研究基于一套从全工况工业设备采集的独特数据集展开实证分析，该数据集明确捕捉了显著的流程诱发变异。我们评估了哪些模型类别能够有效刻画这种复杂性，从经典的孤立森林基线方法出发，延伸至多种自编码器架构。实验结果表明，孤立森林难以充分建模数据中存在的非周期性、多尺度动态特征，而自编码器则 consistently 表现更优。其中，时序卷积自编码器实现了最稳健的性能，而循环与变分自编码器变体则需要更精细的调参。

摘要 (Abstract)

Industrial time-series data from real production environments exhibits substantially higher complexity than commonly used benchmark datasets, primarily due to heterogeneous, multi-stage operational processes. As a result, anomaly detection methods validated under simplified conditions often fail to generalize to industrial settings. This work presents an empirical study on a unique dataset collected from fully operational industrial machinery, explicitly capturing pronounced process-induced variability. We evaluate which model classes are capable of capturing this complexity, starting with a classical Isolation Forest baseline and extending to multiple autoencoder architectures. Experimental results show that Isolation Forest is insufficient for modeling the non-periodic, multi-scale dynamics present in the data, whereas autoencoders consistently perform better. Among them, temporal convolutional autoencoders achieve the most robust performance, while recurrent and variational variants require more careful tuning.

关键词: anomaly detection, industrial time series, autoencoders, temporal convolutional autoencoders, Isolation Forest, process complexity, real-world case study, multi-stage operational processes

244. ❌ Nested Fourier-enhanced neural operator for efficient modeling of radiation transfer in fires

作者: Anran Jiao, Wengyao Jiang, Xiaoyi Lu, Yi Wang, Lu Lu 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13919v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文提出了一种基于傅里叶增强多输入神经算子（Fourier-MIONet）的机器学习框架，用于高效模拟火灾中的辐射传递，属于AI在科学计算（具体为计算流体动力学）中的应用。论文核心是神经算子架构在物理模拟中的创新应用，与绝大多数关键词（涉及大语言模型、训练技术、推理优化、智能体等）完全无关。仅与最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因为论文属于AI在科学（计算流体动力学/火灾模拟）领域的应用，但未涉及生物信息学或化学信息学，且创新点更侧重于特定神经算子架构而非广义的AI for Science方法论，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该研究针对火灾模拟中辐射传递方程计算成本高的问题，提出了一种基于嵌套傅里叶增强神经算子的机器学习框架，在3D火灾模拟中实现了2-4%的全局相对误差，同时推理速度比传统有限体积法更快。

摘要翻译

计算流体动力学（CFD）已成为预测火灾行为的重要工具，但兼顾效率与精度仍具挑战性。火灾模拟中计算成本的主要来源是辐射传递的建模，这通常是火灾中主导的传热机制。使用传统数值方法求解高维辐射传递方程（RTE）可能成为性能瓶颈。本文提出一种基于傅里叶增强多输入神经算子（Fourier-MIONet）的机器学习框架，作为RTE直接数值积分的高效替代方案。我们首先在小尺度二维池火中评估神经算子架构的性能，发现Fourier-MIONet能提供最精确的辐射解预测。随后将该方法扩展至三维CFD火灾模拟，其中计算网格在多个层级上进行局部细化。在此类高分辨率设置下，用于直接场对场映射的整体代理模型难以训练且计算效率低下。为解决此问题，我们提出一种嵌套式Fourier-MIONet来预测跨多级网格细化的辐射解。我们在使用FireFOAM模拟的三维McCaffrey池火上验证了该方法，包括固定火源尺寸及在连续热释放速率（HRR）范围内训练的通用模型。所提方法在三维变HRR场景中实现了2-4%的全局相对误差，同时其推理速度比FireFOAM中采用16立体角离散的有限体积辐射求解的预估计算成本更快。凭借快速而精确的推理能力，该代理模型使更高精度的辐射处理成为可能，并有助于将更精细的光谱分辨辐射模型纳入工程应用的CFD火灾模拟中。

摘要 (Abstract)

Computational fluid dynamics (CFD) has become an essential tool for predicting fire behavior, yet maintaining both efficiency and accuracy remains challenging. A major source of computational cost in fire simulations is the modeling of radiation transfer, which is usually the dominant heat transfer mechanism in fires. Solving the high-dimensional radiative transfer equation (RTE) with traditional numerical methods can be a performance bottleneck. Here, we present a machine learning framework based on Fourier-enhanced multiple-input neural operators (Fourier-MIONet) as an efficient alternative to direct numerical integration of the RTE. We first investigate the performance of neural operator architectures for a small-scale 2D pool fire and find that Fourier-MIONet provides the most accurate radiative solution predictions. The approach is then extended to 3D CFD fire simulations, where the computational mesh is locally refined across multiple levels. In these high-resolution settings, monolithic surrogate models for direct field-to-field mapping become difficult to train and computationally inefficient. To address this issue, a nested Fourier-MIONet is proposed to predict radiation solutions across multiple mesh-refinement levels. We validate the approach on 3D McCaffrey pool fires simulated with FireFOAM, including fixed fire sizes and a unified model trained over a continuous range of heat release rates (HRRs). The proposed method achieves global relative errors of 2-4% for 3D varying-HRR scenarios while providing faster inference than the estimated cost of one finite-volume radiation solve in FireFOAM for the 16-solid-angle case. With fast and accurate inference, the surrogate makes higher-fidelity radiation treatments practical and enables the incorporation of more spectrally resolved radiation models into CFD fire simulations for engineering applications.

关键词: neural operator, radiation transfer, computational fluid dynamics, fire simulation, Fourier-MIONet, surrogate model, radiative transfer equation, machine learning

245. ❌ MolCryst-MLIPs: A Machine-Learned Interatomic Potentials Database for Molecular Crystals

作者: Adam Lahouari, Shen Ai, Jihye Han, Jillian Hoffstadt, Philipp Hoellmer, Charlotte Infante, Pulkita Jain, Sangram Kadam, Maya M. Martirossyan, Amara McCune, Hypatia Newton, Shlok J. Paul, Willmor Pena, Jonathan Raghoonanan, Sumon Sahu, Oliver Tan, Andrea Vergara, Jutta Rogal, Mark E. Tuckerman 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13897v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于分子晶体的机器学习原子间势（MLIP）数据库开发，属于AI for Science（科学人工智能）领域，具体涉及化学信息学（Cheminformatics）和生物信息学（Bioinformatics）的应用。论文内容与AI for Science/Bioinformatics/Cheminformatics关键词高度相关（10分），因为其核心是使用机器学习方法（MACE模型和自动化机器学习管道）解决分子晶体模拟的科学问题。然而，论文未涉及大语言模型（LLMs）、深度学习技术原理创新、或其他评分关键词中的具体技术（如MoE、SFT、RAG、推理加速等），因此其他所有关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文开发了一个名为MolCryst-MLIPs的开放数据库，包含针对九种分子晶体系统微调的机器学习原子间势模型，用于在不同热力学条件下进行分子动力学模拟，以研究分子晶体多态性。

摘要翻译

我们推出了一个名为MolCryst-MLIPs的开放式机器学习原子间势能（Machine-Learned Interatomic Potentials, MLIP）分子晶体数据库。其首版发布了针对九种分子晶体体系——苯甲酰胺、苯甲酸、香豆素、均四甲苯、异烟酰胺、烟酰胺（Niacinamide）、烟酰胺（Nicotinamide）、吡嗪酰胺和间苯二酚——的精细调优MACE模型。这些模型通过自动化机器学习流程（Automated Machine Learning Pipeline, AMLP）开发，该流程将MLIP开发的全工作链（从参考数据生成到模型训练与验证）整合为一条可复现且用户友好的流水线。所有模型均基于MACE-MH-1基础模型（omol头）进行微调，在所有体系中实现了平均能量平均绝对误差（MAE）0.141千焦/摩尔/原子和平均力平均绝对误差0.648千焦/摩尔/埃。通过分子动力学模拟，我们利用能量守恒、P2取向序参数和径向分布函数评估了模型的动力学稳定性与结构完整性。所发布的模型与数据集构成了一个持续扩展的已验证MLIP开放数据库，可直接用于不同热力学条件下分子晶体多晶型现象的生产级分子动力学模拟。

摘要 (Abstract)

We present an open Molecular Crystal (MC) database of Machine-Learned Interatomic Potentials (MLIP) called MolCryst-MLIPs. The first release comprises fine-tuned MACE models for nine molecular crystal systems – Benzamide, Benzoic acid, Coumarin, Durene, Isonicotinamide, Niacinamide, Nicotinamide, Pyrazinamide, and Resorcinol – developed using the Automated Machine Learning Pipeline (AMLP), which streamlines the entire MLIP development workflow, from reference data generation to model training and validation, into a reproducible and user-friendly pipeline. Models are fine-tuned from the MACE-MH-1 foundation model (omol head), yielding a mean energy MAE of 0.141 kJ/mol/atom and a mean force MAE of 0.648 kJ/mol/Angstrom across all systems. Dynamical stability and structural integrity, as assessed through energy conservation, P2 orientational order parameters, and radial distribution functions, are evaluated using molecular dynamics simulations. The released models and datasets constitute a growing open database of validated MLIPs, ready for production MD simulations of molecular crystal polymorphism under different thermodynamic conditions.

关键词: Molecular Crystals, Machine-Learned Interatomic Potentials, MACE models, Automated Machine Learning Pipeline, Molecular Dynamics Simulations, Polymorphism, Database, Fine-tuning

246. ❌ Sandpile Economics: Theory, Identification, and Evidence

作者: Diego Vallarino 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13890v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《Sandpile Economics: Theory, Identification, and Evidence》研究宏观经济不稳定性和生产网络的演化几何学，使用Forman-Ricci曲率分析输入-输出图的局部替代可能性，并实证检验曲率与经济增长的关系。所有评分关键词均涉及大模型、深度学习技术原理或AI在科学领域的应用，而该论文专注于经济学理论、网络分析和实证宏观经济学，未涉及任何人工智能、机器学习或大模型相关内容，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了为什么资本主义经济会反复产生与触发冲击规模不成比例的危机，并提出Sandpile Economics框架，通过生产网络的Forman-Ricci曲率来捕捉局部替代可能性，实证发现曲率能稳健预测中期产出动态并解释国家间韧性差异。

摘要翻译

为何资本主义经济会反复爆发严重程度远超触发冲击规模的危机？本文基于生产网络的演化几何结构提出了一种结构性解释。随着经济通过专业化、一体化与竞争选择不断演化，其部门间关联逐渐趋向几何脆弱性递增的配置状态，最终跨越临界阈值，使得微小扰动引发不成比例的大规模级联效应。
我们引入“沙堆经济学”这一形式化框架，将宏观经济不稳定性解读为非均衡生产网络的涌现属性。其关键状态变量是投入产出图的福尔曼-里奇曲率，该指标刻画了供应链中断时局部的替代可能性。我们证明当曲率低于内生临界阈值时，级联效应规模的分布服从尾指数$α\in (1,2)$的幂律分布，这意味着经济进入了无界放大的状态区间。
其深层机制具有演化特征：专业化降低了投入替代弹性，推动经济趋向临界状态；而危机事件会诱发内生性网络重构与路径依赖。这些动态本质上是非遍历的，无法被代表性主体框架所捕捉。
基于全球投入产出数据的实证研究表明：生产网络持续运行于负曲率状态，且曲率指标能稳健预测中期产出动态。曲率每增加一个标准差，对应着三年期累计增长率的显著提升；在解释各国经济韧性差异时，曲率指标系统性地优于传统网络度量指标。

摘要 (Abstract)

Why do capitalist economies recurrently generate crises whose severity is disproportionate to the size of the triggering shock? This paper proposes a structural answer grounded in the evolutionary geometry of production networks. As economies evolve through specialization, integration, and competitive selection, their inter-sectoral linkages drift toward configurations of increasing geometric fragility, eventually crossing a threshold beyond which small disturbances generate disproportionately large cascades. We introduce Sandpile Economics, a formal framework that interprets macroeconomic instability as an emergent property of disequilibrium production networks. The key state variable is the Forman–Ricci curvature of the input–output graph, capturing local substitution possibilities when supply chains are disrupted. We show that when curvature falls below an endogenous threshold, the distribution of cascade sizes follows a power law with tail index $α\in (1,2)$, implying a regime of unbounded amplification. The underlying mechanism is evolutionary: specialization reduces input substitutability, pushing the economy toward criticality, while crisis episodes induce endogenous network reconfiguration and path dependence. These dynamics are inherently non-ergodic and cannot be captured by representative-agent frameworks. Empirically, using global input–output data, we document that production networks operate in persistently negative curvature regimes and that curvature robustly predicts medium-run output dynamics. A one-standard-deviation increase in curvature is associated with higher cumulative growth over three-year horizons, and curvature systematically outperforms standard network metrics in explaining cross-country differences in resilience.

关键词: Sandpile Economics, production networks, Forman-Ricci curvature, macroeconomic instability, cascade amplification, input-output graph, evolutionary geometry, economic resilience

247. ❌ Drowsiness-Aware Adaptive Autonomous Braking System based on Deep Reinforcement Learning for Enhanced Road Safety

作者: Hossem Eddine Hafidi, Elisabetta De Giovanni, Teodoro Montanaro, Ilaria Sergi, Massimo De Vittorio, Luigi Patrono 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13878v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于使用深度强化学习（DRL）和循环神经网络（RNN）开发一个基于驾驶员生理状态的自动驾驶刹车系统，并在CARLA模拟环境中进行评估。论文的核心技术是深度强化学习（Double-Dueling DQN）和RNN用于ECG信号处理，所有给定的关键词均与大语言模型（LLMs）及其相关技术（如MoE、SFT、RLHF、RAG、CoT、Agents等）或特定的AI科学应用（如生物信息学）相关，而本论文未涉及任何大语言模型、其训练方法、推理技术、对齐、代理或相关应用，也未涉及生物信息学或化学信息学，因此所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于深度强化学习的自适应自主刹车系统，通过RNN分析ECG信号实时检测驾驶员 drowsiness，并在CARLA模拟中实现了99.99%的防碰撞成功率。

摘要翻译

驾驶员疲劳显著削弱了对安全刹车距离的准确判断能力，据估计在欧洲导致了10%-20%的道路事故。传统的驾驶辅助系统缺乏对疲劳等实时生理状态的适应性。本文提出了一种基于深度强化学习的自主刹车系统，该系统将车辆动力学与驾驶员的生理数据相结合。疲劳状态通过循环神经网络（RNN, Recurrent Neural Network）从心电信号中检测得出，该网络是经过对不同分段和重叠配置的2分钟时间窗口进行广泛基准分析后选定的。推断出的疲劳状态被纳入双竞争深度Q网络（DQN, Double-Dueling Deep Q-Network）智能体的可观测状态空间中，其中驾驶员能力下降被建模为动作延迟。该系统在高保真CARLA仿真环境中实现并评估。实验结果表明，所提出的智能体在疲劳和非疲劳条件下均实现了99.99%的碰撞避免成功率。这些发现证明了基于生理感知的控制策略对于增强自适应和智能驾驶安全系统的有效性。

摘要 (Abstract)

Driver drowsiness significantly impairs the ability to accurately judge safe braking distances and is estimated to contribute to 10%-20% of road accidents in Europe. Traditional driver-assistance systems lack adaptability to real-time physiological states such as drowsiness. This paper proposes a deep reinforcement learning-based autonomous braking system that integrates vehicle dynamics with driver physiological data. Drowsiness is detected from ECG signals using a Recurrent Neural Network (RNN), selected through an extensive benchmark analysis of 2-minute windows with varying segmentation and overlap configurations. The inferred drowsiness state is incorporated into the observable state space of a Double-Dueling Deep Q-Network (DQN) agent, where driver impairment is modeled as an action delay. The system is implemented and evaluated in a high-fidelity CARLA simulation environment. Experimental results show that the proposed agent achieves a 99.99% success rate in avoiding collisions under both drowsy and non-drowsy conditions. These findings demonstrate the effectiveness of physiology-aware control strategies for enhancing adaptive and intelligent driving safety systems.

关键词: Deep Reinforcement Learning, Autonomous Braking System, Driver Drowsiness Detection, ECG Signal Analysis, Recurrent Neural Network, CARLA Simulation, Road Safety, Adaptive Control

248. ❌ Hardware-Efficient Neuro-Symbolic Networks with the Exp-Minus-Log Operator

作者: Eymen Ipek 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13871v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文主要研究神经符号网络和硬件高效的深度学习架构，使用Exp-Minus-Log算子构建可解释的混合模型。论文内容与绝大多数关键词（涉及大模型、训练方法、推理技术、对齐、代理等）完全无关。仅与’Mechanistic Interpretability OR Explainable AI’有一定关联（5分），因为论文提到了模型的解释性和形式验证能力，但这并非论文的核心创新点（核心是硬件效率和神经符号架构）。

!!! tip deepseek-chat TL;DR

该论文提出了一种使用Exp-Minus-Log算子的硬件高效神经符号网络架构，通过将DNN与可解释的EML树结合，在保持性能的同时提高了模型在边缘硬件上的部署效率和形式验证能力。

摘要翻译

深度神经网络（DNNs）在回归和分类任务上实现了最先进的精度，然而其两个结构性缺陷持续阻碍了其在安全关键、资源受限环境中的部署：（i）所学函数的不透明性，这阻碍了形式化验证；（ii）对异构的、受库函数限制的激活函数的依赖，这增加了边缘硬件上的延迟和硅面积。最近提出的指数减对数（EML）谢弗算子，eml(x, y) = exp(x) - ln(y)，被Odrzywolek（2026年）证明足以——连同常数1一起——将每一个标准初等函数表达为相同节点的二叉树。我们提出将EML基元嵌入到传统的DNN架构中，从而产生一种混合的DNN-EML模型，其中主干部分学习分布式表示，而头部是一个深度有界、权重稀疏的EML树，其截断后的权重可坍缩为闭式符号子表达式。我们推导了前向方程，证明了计算成本边界，分析了相对于多层感知机（MLPs）和物理信息神经网络（PINNs）的推理与训练加速，并量化了其在FPGA/模拟部署中的权衡。我们认为，DNN-EML组合填补了文献中的一个空白：先前的神经符号和方程学习器方法（EQL、KAN、AI-Feynman）使用异构的基元集合，且未利用单一的可硬件实现的谢弗元件。一项平衡的评估表明，EML不太可能加速训练，且在商用CPU/GPU上也不太可能加速推理；然而，在定制的EML单元（FPGA逻辑块或模拟电路）上，其渐近延迟优势可达一个数量级，同时还能获得可解释性和形式化验证可处理性的提升。

摘要 (Abstract)

Deep neural networks (DNNs) deliver state-of-the-art accuracy on regression and classification tasks, yet two structural deficits persistently obstruct their deployment in safety-critical, resource-constrained settings: (i) opacity of the learned function, which precludes formal verification, and (ii) reliance on heterogeneous, library-bound activation functions that inflate latency and silicon area on edge hardware. The recently introduced Exp-Minus-Log (EML) Sheffer operator, eml(x, y) = exp(x) - ln(y), was shown by Odrzywolek (2026) to be sufficient - together with the constant 1 - to express every standard elementary function as a binary tree of identical nodes. We propose to embed EML primitives inside conventional DNN architectures, yielding a hybrid DNN-EML model in which the trunk learns distributed representations and the head is a depth-bounded, weight-sparse EML tree whose snapped weights collapse to closed-form symbolic sub-expressions. We derive the forward equations, prove computational-cost bounds, analyse inference and training acceleration relative to multilayer perceptrons (MLPs) and physics-informed neural networks (PINNs), and quantify the trade-offs for FPGA/analog deployment. We argue that the DNN-EML pairing closes a literature gap: prior neuro-symbolic and equation-learner approaches (EQL, KAN, AI-Feynman) work with heterogeneous primitive sets and do not exploit a single hardware-realisable Sheffer element. A balanced assessment shows that EML is unlikely to accelerate training, and on commodity CPU/GPU it is also unlikely to accelerate inference; however, on a custom EML cell (FPGA logic block or analog circuit) the asymptotic latency advantage can reach an order of magnitude with simultaneous gain in interpretability and formal-verification tractability.

关键词: Neuro-Symbolic Networks, Exp-Minus-Log Operator, Hardware-Efficient, Interpretability, Formal Verification, Edge Hardware, FPGA Deployment, DNN-EML Hybrid Model

249. ❌ Gradient Descent’s Last Iterate is Often (slightly) Suboptimal

作者: Guy Kornowski, Ohad Shamir 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13870v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究凸优化中梯度下降（GD）和随机梯度下降（SGD）的最终迭代收敛性，属于经典优化理论范畴，与所有评分关键词（均围绕大模型、深度学习技术及其应用）完全无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文证明了在梯度下降和随机梯度下降中，如果没有预先知道总迭代次数T，最终迭代的收敛误差无法避免一个多对数因子的次优性，从而证实了Jain等人的猜想。

摘要翻译

我们考虑在最小化凸利普希茨函数这一经典设定下，使用梯度下降法（GD）或其随机变体（SGD），并考察其最后迭代点的收敛性。目前已知，采用标准步长选择时，经过 $T$ 步迭代后，最后迭代点的收敛速度为 $\log T/\sqrt{T}$。Jain 等人 [2019] 的一项突破性成果通过构建一种非标准步长序列，恢复了最优的 $1/\sqrt{T}$ 收敛速率。然而，该序列需要预先确定 $T$，这与适用于任意时间范围的常见步长调度方式不同。此外，Jain 等人推测，若没有关于 $T$ 的先验知识，任何步长序列都无法保证 SGD 最后迭代点达到最优误差，这一论断至今未被证明。我们证实了这一猜想，并进一步证明，即使在 GD 的无噪声情形下，若要求对任意时刻的最后迭代点提供保证，也无法避免在 $T$ 上产生额外的多对数因子。我们的证明还表明，这种（略微）次优的停止时间在实践中是不可避免且普遍存在的。

摘要 (Abstract)

We consider the well-studied setting of minimizing a convex Lipschitz function using either gradient descent (GD) or its stochastic variant (SGD), and examine the last iterate convergence. By now, it is known that standard stepsize choices lead to a last iterate convergence rate of $\log T/\sqrt{T}$ after $T$ steps. A breakthrough result of Jain et al. [2019] recovered the optimal $1/\sqrt{T}$ rate by constructing a non-standard stepsize sequence. However, this sequence requires choosing $T$ in advance, as opposed to common stepsize schedules which apply for any time horizon. Moreover, Jain et al. conjectured that without prior knowledge of $T$, no stepsize sequence can ensure the optimal error for SGD’s last iterate, a claim which so far remained unproven. We prove this conjecture, and in fact show that even in the noiseless case of GD, it is impossible to avoid an excess poly-log factor in $T$ when considering an anytime last iterate guarantee. Our proof further suggests that such (slightly) suboptimal stopping times are unavoidably common.

关键词: gradient descent, stochastic gradient descent, last iterate convergence, convex optimization, stepsize sequence, convergence rate, suboptimality

250. ❌ Simulation-Based Optimisation of Batting Order and Bowling Plans in T20 Cricket

作者: Tinniam V Ganesh 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13861v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于使用马尔可夫决策过程（MDP）、蒙特卡洛模拟和模拟退火等传统优化与模拟方法来解决T20板球中的战术决策问题（击球顺序和投球计划），未涉及任何大模型、深度学习、AI技术原理或科学AI应用，与所有评分关键词完全无关。

!!! tip deepseek-chat TL;DR

该研究开发了一个基于马尔可夫决策过程和蒙特卡洛模拟的框架，用于优化T20板球比赛中的击球顺序和投球计划，应用该框架可使特定球队的获胜概率提升4.1个百分点，防守概率提升5.2个百分点。

摘要翻译

本文构建了一个统一的马尔可夫决策过程（MDP）框架，用于优化T20板球比赛中两项重复出现的场内决策——击球顺序选择与投球计划分配，其优化目标直接基于获胜概率与防守概率，而非预期得分。研究利用1,161场印度板球超级联赛（IPL）逐球记录（2008-2025年），通过詹姆斯-斯坦收缩法估计了一个包含三个阶段（强力击球、中场、终局）的球员能力剖面引擎。获胜/防守概率通过向量化蒙特卡洛模拟对N = 50,000条回合轨迹进行评估。击球顺序通过穷举枚举进行搜索。投球计划则通过模拟退火算法在剩余配额约束下计算，并遵循同一投球手不得连续投球的规定。应用于两场2026年IPL比赛的案例显示，最优击球顺序使孟买印第安人队的获胜概率提升了4.1个百分点（从52.4%增至56.5%），而最优古吉拉特泰坦队投球计划则使防守概率提升了5.2个百分点（从39.1%增至44.3%）。两种情况下观察到的次优决策均源于无视比赛阶段的部署方式：这些决策在综合指标下看似合理，但在应用阶段特异性球员剖面时则暴露出显著的代价。

摘要 (Abstract)

This paper develops a unified Markov Decision Process (MDP) framework for optimising two recurring in-match decisions in T20 cricket namely batting order selection and bowling plan assignment, directly in terms of win and defend probability rather than expected runs. A three-phase player profile engine (Powerplay, Middle, Death) with James-Stein shrinkage is estimated from 1,161 IPL ball-by-ball records (2008-2025). Win/defend probabilities are evaluated by vectorised Monte Carlo simulation over N = 50,000 innings trajectories. Batting orders are searched by exhaustive enumeration. Bowling plans are computed by simulated annealing over the remaining quota with the constraint that the same bowler cannot bowl consecutive overs. Applied to two 2026 IPL matches, the optimal batting order improves Mumbai Indians’ win probability by 4.1 percentage points (52.4% to 56.5%), and the optimal Gujarat Titans bowling plan improves defend probability by 5.2 percentage points (39.1% to 44.3%). In both cases the observed sub-optimality is consistent with phase-agnostic deployment in decisions that appear reasonable by aggregate metrics but are exposed as costly when phase-specific profiles are applied.

关键词: Markov Decision Process, Monte Carlo simulation, T20 cricket, batting order optimization, bowling plan assignment, win probability, simulated annealing, player profiling

251. ❌ Randomized Neural Networks for Integro-Differential Equations with Application to Neutron Transport

作者: Haoning Dang, Fei Wang, Yifan Chen, Zhouyu Liu, Dong Liu, Hongchun Wu 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13830v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究随机神经网络（RaNNs）在求解积分-微分方程（特别是中子输运方程）中的应用，属于科学计算和数值方法领域。论文内容与绝大多数关键词（涉及大模型、深度学习技术原理、训练方法、推理优化、对齐、智能体等）完全无关。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文将AI方法（随机神经网络）应用于科学计算问题（中子输运），但论文核心是数值方法创新，而非典型的AI for Science（如生物信息学、化学信息学），因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于随机神经网络（RaNNs）的无网格配点框架，用于高效求解线性积分-微分方程，并以中子输运方程为例验证了该方法在保持竞争力的精度下显著降低了训练成本。

摘要翻译

积分-微分方程广泛出现于输运、动理学理论、辐射传输及多物理场建模等诸多应用领域，其中非局域积分算子将相空间各处的解耦合起来。这种非局域性往往在确定性离散化中引入稠密的耦合块，导致计算成本与内存使用量增加；而基于物理信息的神经网络则可能面临昂贵的非凸训练过程以及对超参数选择的敏感性。本文提出将随机化神经网络作为一种无网格配置框架，用于求解线性积分-微分方程。由于随机化神经网络近似通过全局支持的随机特征本质上是稠密的，非局域积分算子不会引入额外的稀疏性损失，同时近似解仍能以相对较少的可训练自由度表示。通过随机固定隐藏层参数并仅求解线性输出权重，训练过程可简化为输出系数上的凸最小二乘问题，从而实现稳定高效的优化。作为代表性应用，我们将所提框架应用于稳态中子输运方程——一个具有散射积分与多样边界条件的高维线性积分-微分模型。大量数值实验表明，在报告的测试设置中，随机化神经网络方法在达到可比精度的同时，其训练成本显著低于所选用的神经网络与确定性基线方法，这凸显了随机化神经网络作为非局域线性算子数值模拟的一种稳健高效替代方案。

摘要 (Abstract)

Integro-differential equations arise in a wide range of applications, including transport, kinetic theory, radiative transfer, and multiphysics modeling, where nonlocal integral operators couple the solution across phase space. Such nonlocality often introduces dense coupling blocks in deterministic discretizations, leading to increased computational cost and memory usage, while physics-informed neural networks may suffer from expensive nonconvex training and sensitivity to hyperparameter choices. In this work, we present randomized neural networks (RaNNs) as a mesh-free collocation framework for linear integro-differential equations. Because the RaNN approximation is intrinsically dense through globally supported random features, the nonlocal integral operator does not introduce an additional loss of sparsity, while the approximate solution can still be represented with relatively few trainable degrees of freedom. By randomly fixing the hidden-layer parameters and solving only for the linear output weights, the training procedure reduces to a convex least-squares problem in the output coefficients, enabling stable and efficient optimization. As a representative application, we apply the proposed framework to the steady neutron transport equation, a high-dimensional linear integro-differential model featuring scattering integrals and diverse boundary conditions. Extensive numerical experiments demonstrate that, in the reported test settings, the RaNN approach achieves competitive accuracy while incurring substantially lower training cost than the selected neural and deterministic baselines, highlighting RaNNs as a robust and efficient alternative for the numerical simulation of nonlocal linear operators.

关键词: Randomized Neural Networks, Integro-differential Equations, Neutron Transport, Mesh-free Collocation, Nonlocal Operators, Convex Optimization, Numerical Simulation

252. ❌ Beyond State Consistency: Behavior Consistency in Text-Based World Models

作者: Youling Huang, Guanqiao Chen, Junchi Yao, Lu Wang, Fangkai Yang, Chao Du, ChenZhuo Zhao, Pu Zhao, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13824v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于文本环境中的世界模型（World Models）训练，提出了一种新的行为一致性训练范式（Behavior Consistency Reward），旨在改善世界模型与真实环境之间的功能一致性。论文核心与关键词’World Models AND General World Models’高度相关（10分），因为这是论文研究的核心主题。其他关键词主要涉及大语言模型（LLMs）的具体技术（如MoE、SFT、RLHF、RAG、量化等）、推理方法（如CoT、MCTS）、代理系统或特定应用领域（如AI for Science），而本文研究的是通用的文本世界模型训练方法，并未专门涉及这些具体的大模型技术、应用领域或代理架构，因此其他关键词均评为0分。

!!! tip deepseek-chat TL;DR

该论文针对文本环境中的世界模型训练，提出了一种基于行为一致性奖励（BehR）的新训练范式，以改善世界模型与真实环境之间的长期功能对齐，实验表明该方法在WebShop和TextWorld环境中能提升长期对齐效果并降低离线评估中的误报率。

摘要翻译

世界模型正日益成为在线规划和离线评估中，用于衡量交互智能体所生成动作后果的关键组件。在基于文本的环境中，世界模型通常使用单步指标（如精确匹配）进行评估和训练，旨在提升预测状态与现实状态之间的相似度，但此类指标已被证明不足以捕捉智能体的实际行为。为解决这一问题，我们提出了一种新的行为对齐训练范式，旨在提升世界模型与真实环境之间的功能一致性。该范式聚焦于优化一个可处理的步级指标——行为一致性奖励（Behavior Consistency Reward, BehR），该指标用于衡量在冻结参考智能体（Reference Agent）下，已记录的下一个动作在真实状态与世界模型预测状态之间的似然变化程度。在WebShop和TextWorld上的实验表明，基于BehR的训练在多种设定下提升了长期对齐效果，其中在WebShop中收益最为明显，在接近性能上限的场景中变化较小，同时在四种设定中的三种里保持或提升了单步预测质量。通过BehR训练的世界模型在离线代理评估中也实现了更低的误报率，并在推理时前瞻规划中展现出虽有限但令人鼓舞的改进。

摘要 (Abstract)

World models have been emerging as critical components for assessing the consequences of actions generated by interactive agents in online planning and offline evaluation. In text-based environments, world models are typically evaluated and trained with single-step metrics such as Exact Match, aiming to improve the similarity between predicted and real-world states, but such metrics have been shown to be insufficient for capturing actual agent behavior. To address this issue, we introduce a new behavior-aligned training paradigm aimed at improving the functional consistency between the world model and the real environment. This paradigm focuses on optimizing a tractable step-level metric named Behavior Consistency Reward (BehR), which measures how much the likelihood of a logged next action changes between the real state and the world-model-predicted state under a frozen Reference Agent. Experiments on WebShop and TextWorld show that BehR-based training improves long-term alignment in several settings, with the clearest gains in WebShop and less movement in near-ceiling regimes, while preserving or improving single-step prediction quality in three of four settings. World models trained with BehR also achieve lower false positives in offline surrogate evaluation and show modest but encouraging gains in inference-time lookahead planning.

关键词: World Models, Behavior Consistency, Text-Based Environments, Training Paradigm, Behavior Consistency Reward (BehR), Long-term Alignment, Offline Evaluation, Lookahead Planning

253. ❌ UI-Copilot: Advancing Long-Horizon GUI Automation via Tool-Integrated Policy Optimization

作者: Zhengxi Lu, Fei Tang, Guangyi Liu, Kaitao Song, Xu Tan, Jin Ma, Wenqi Zhang, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13822v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	8.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文UI-Copilot提出一个协作框架，其中GUI代理专注于任务执行，而轻量级副驾驶提供按需辅助，用于内存检索和数值计算。该研究与LLM代理、工具使用和多代理系统高度相关，因为论文涉及GUI代理与副驾驶的协作，以及工具调用学习。与检索增强生成有一定关联，因为副驾驶作为检索器提供内存检索。与幻觉缓解有一定关联，因为论文提到解决数学幻觉问题。其他关键词如MoE、量化、推理加速等与论文内容无关。

!!! tip deepseek-chat TL;DR

论文提出UI-Copilot框架，通过工具集成策略优化解决长视野GUI自动化中的内存退化、进度混淆和数学幻觉问题，在MemGUI-Bench上实现最先进性能，并在AndroidWorld上比基础模型提升17.1%。

摘要翻译

基于多模态大语言模型的图形用户界面智能体已在复杂界面交互任务中展现出强大能力。然而，在长周期任务场景中，这些智能体仍面临挑战，因其常需处理超出自身固有能力的任务，并受困于记忆衰退、进度混淆与数学幻觉等问题。为解决这些挑战，我们提出了UI-Copilot——一个协同合作框架：其中GUI智能体专注于任务执行，而轻量级协处理器则按需提供记忆检索与数值计算支持。我们引入记忆解耦机制，将持久性观察与瞬时执行上下文分离，并通过训练策略智能体使其能依据任务需求选择性调用协处理器作为检索器或计算器。为有效学习工具调用能力，我们提出了工具集成策略优化方法，该方法通过单轮预测单独优化工具选择，同时基于策略的多轮推演优化任务执行。实验结果表明，UI-Copilot-7B在具有挑战性的MemGUI-Bench上取得了最先进的性能表现，超越了GUI-Owl-7B、UI-TARS-1.5-7B等强大的7B规模GUI智能体。此外，UI-Copilot-7B在AndroidWorld数据集上相比基础Qwen模型实现了17.1%的绝对性能提升，这凸显了UI-Copilot对真实世界GUI任务具有强大的泛化能力。

摘要 (Abstract)

MLLM-based GUI agents have demonstrated strong capabilities in complex user interface interaction tasks. However, long-horizon scenarios remain challenging, as these agents are burdened with tasks beyond their intrinsic capabilities, suffering from memory degradation, progress confusion, and math hallucination. To address these challenges, we present UI-Copilot, a collaborative framework where the GUI agent focuses on task execution while a lightweight copilot provides on-demand assistance for memory retrieval and numerical computation. We introduce memory decoupling to separate persistent observations from transient execution context, and train the policy agent to selectively invoke the copilot as Retriever or Calculator based on task demands. To enable effective tool invocation learning, we propose Tool-Integrated Policy Optimization (TIPO), which separately optimizes tool selection through single-turn prediction and task execution through on-policy multi-turn rollouts. Experimental results show that UI-Copilot-7B achieves state-of-the-art performance on challenging MemGUI-Bench, outperforming strong 7B-scale GUI agents such as GUI-Owl-7B and UI-TARS-1.5-7B. Moreover, UI-Copilot-7B delivers a 17.1% absolute improvement on AndroidWorld over the base Qwen model, highlighting UI-Copilot’s strong generalization to real-world GUI tasks.

关键词: GUI automation, LLM agents, tool use, policy optimization, memory decoupling, multi-agent collaboration, long-horizon tasks, UI-Copilot

254. ❌ RPS: Information Elicitation with Reinforcement Prompt Selection

作者: Tao Wang, Jingyao Lu, Xibo Wang, Haonan Huang, Su Yao, Zhiqiang Hu, Xingyan Chen, Enmao Diao 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13817v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在开放对话中信息获取的局限性，并提出基于强化学习的提示选择框架RPS，仅与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），其他关键词如MoE、量化、推理加速、科学AI等均未涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对LLM在开放对话中难以获取用户隐藏信息的问题，提出了基于强化学习的提示选择框架RPS，并在法律案例数据集上验证了其有效性。

摘要翻译

大型语言模型（LLM）在对话生成与推理方面展现出卓越能力，但在开放式对话中有效引导用户已知却隐藏信息方面仍存在局限。在个人助理、教学系统、法律或临床支持等诸多交互式人工智能应用中，用户常因隐私顾虑、表述模糊或社交犹豫而隐瞒敏感或不确定信息，这使得LLM难以收集完整且符合情境的输入。本研究界定了开放式对话场景中的信息引导问题，并提出强化提示选择框架——一种将提示选择建模为序列决策问题的轻量级强化学习框架。为在受控环境中分析该问题，我们设计了合成实验：其中强化学习智能体表现优于随机查询基线，验证了基于策略的自适应信息引导方法的潜力。基于此发现，RPS通过从提示池中学习策略，在对话中自适应地引导用户透露隐藏或未完整表达的信息。我们还构建了IELegal——一个基于真实法律案例文件构建的新型基准数据集，用于模拟以揭示案件相关事实为目标的对话式信息引导任务。在该场景中，RPS的表现优于静态提示基线，证明了自适应提示选择在LLM驱动的对话系统中引导关键信息的有效性。

摘要 (Abstract)

Large language models (LLMs) have shown remarkable capabilities in dialogue generation and reasoning, yet their effectiveness in eliciting user-known but concealed information in open-ended conversations remains limited. In many interactive AI applications, such as personal assistants, tutoring systems, and legal or clinical support, users often withhold sensitive or uncertain information due to privacy concerns, ambiguity, or social hesitation. This makes it challenging for LLMs to gather complete and contextually relevant inputs. In this work, we define the problem of information elicitation in open-ended dialogue settings and propose Reinforcement Prompt Selection (RPS), a lightweight reinforcement learning framework that formulates prompt selection as a sequential decision-making problem. To analyze this problem in a controlled setting, we design a synthetic experiment, where a reinforcement learning agent outperforms a random query baseline, illustrating the potential of policy-based approaches for adaptive information elicitation. Building on this insight, RPS learns a policy over a pool of prompts to adaptively elicit concealed or incompletely expressed information from users through dialogue. We also introduce IELegal, a new benchmark dataset constructed from real legal case documents, which simulates dialogue-based information elicitation tasks aimed at uncovering case-relevant facts. In this setting, RPS outperforms static prompt baselines, demonstrating the effectiveness of adaptive prompt selection for eliciting critical information in LLM-driven dialogue systems.

关键词: Large Language Models, Information Elicitation, Reinforcement Learning, Prompt Selection, Dialogue Systems, Legal Case Analysis, Adaptive Querying, Benchmark Dataset

255. ❌ Composite Silhouette: A Subsampling-based Aggregation Strategy

作者: Aggelos Semoglou, Aristidis Likas, John Pavlopoulos 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13816v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于无监督学习中的聚类数量确定问题，提出了一种新的内部验证指标Composite Silhouette，通过子采样聚合策略改进传统的Silhouette系数。论文内容完全围绕传统机器学习中的聚类分析、内部验证指标和统计方法，不涉及任何大模型、深度学习、AI for Science或相关技术原理。所有评分关键词均与大模型、深度学习技术及其应用相关，与该论文的研究主题无任何关联，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对无监督学习中确定聚类数量的挑战，提出了一种基于子采样的聚合策略Composite Silhouette，通过自适应结合微平均和宏平均Silhouette分数，在合成和真实数据集上更准确地恢复了真实聚类数量。

摘要翻译

确定聚类数量是无监督学习中的核心挑战，因为真实标签通常不可得。轮廓系数是解决该任务广泛使用的内部验证指标，但其标准微观平均形式在规模不平衡时倾向于偏好更大的聚类。宏观平均通过等权重加权各聚类以缓解这一偏差，但可能过度强调少数群体的噪声影响。我们提出复合轮廓系数，这是一种用于选择聚类数量的内部准则，它通过聚合多次子采样聚类的结果而非依赖单一划分来整合证据。对于每个子样本，微观与宏观平均轮廓分数通过自适应凸权重进行组合，该权重由其归一化差异决定并经有界非线性函数平滑处理；最终得分通过平均这些子样本层级的复合分数获得。我们建立了该准则的关键性质，并为其子采样估计推导了有限样本集中性保证。在合成与真实数据集上的实验表明，复合轮廓系数有效调和了微观与宏观平均的优势，能更准确地恢复真实聚类数量。

摘要 (Abstract)

Determining the number of clusters is a central challenge in unsupervised learning, where ground-truth labels are unavailable. The Silhouette coefficient is a widely used internal validation metric for this task, yet its standard micro-averaged form tends to favor larger clusters under size imbalance. Macro-averaging mitigates this bias by weighting clusters equally, but may overemphasize noise from under-represented groups. We introduce Composite Silhouette, an internal criterion for cluster-count selection that aggregates evidence across repeated subsampled clusterings rather than relying on a single partition. For each subsample, micro- and macro-averaged Silhouette scores are combined through an adaptive convex weight determined by their normalized discrepancy and smoothed by a bounded nonlinearity; the final score is then obtained by averaging these subsample-level composites. We establish key properties of the criterion and derive finite-sample concentration guarantees for its subsampling estimate. Experiments on synthetic and real-world datasets show that Composite Silhouette effectively reconciles the strengths of micro- and macro-averaging, yielding more accurate recovery of the ground-truth number of clusters.

关键词: clustering, unsupervised learning, Silhouette coefficient, internal validation, subsampling, cluster-count selection, micro-averaging, macro-averaging

256. ❌ Robust Ultra Low-Bit Post-Training Quantization via Stable Diagonal Curvature Estimate

作者: Jaemin Kim, Sungkyun Kim, Junyeol Lee, Jiwon Seo 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13806v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究Post-Training Quantization (PTQ)方法，直接对应’Post-training OR Supervised Fine-tuning OR SFT’和’Quantization OR Model Compression OR Low-bit Weights’两个关键词，且论文明确针对Large Language Models (LLMs)，因此这三个关键词高度相关（10分）。其他关键词如MoE、SLMs、Scaling Laws、Instruction Tuning等均未在论文中涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为DASH-Q的鲁棒后训练量化框架，通过使用对角Hessian近似和迭代加权最小二乘法，在超低比特量化下显著提升了大型语言模型的零样本准确率，平均提高7.01%，最高提升14.01%。

摘要翻译

大语言模型（LLMs）已在众多领域得到广泛应用，但其庞大的规模使得部署面临挑战。训练后量化（Post-Training Quantization, PTQ）通过利用少量校准数据集来减少内存占用，且无需重新训练。近期基于海森矩阵的PTQ方法通过跨通道依赖性来补偿量化误差，但由于有限校准数据产生的噪声化曲率估计，此类方法在低比特位宽下性能会下降。我们提出了DASH-Q，一个采用对角海森矩阵近似与迭代加权最小二乘法的鲁棒PTQ框架。通过舍弃易受噪声影响的依赖性，DASH-Q在过滤采样噪声的同时，优先保留了显著特征的能量。在超低比特位宽场景下，我们的方法超越了其他PTQ基线模型，在五种基线LLM模型上，零样本准确率平均提升7.01%，最高较最强基线提升14.01%，并且在使用极少量校准数据时仍表现出鲁棒且稳定的性能。

摘要 (Abstract)

Large Language Models (LLMs) are widely used across many domains, but their scale makes deployment challenging. Post-Training Quantization (PTQ) reduces memory footprint without retraining by leveraging a small calibration set. Recent Hessian-based PTQ methods compensate quantization error via cross-channel dependencies, but such approaches degrade at low bit-widths due to noisy curvature estimates from limited calibration data. We propose DASH-Q, a robust PTQ framework using diagonal Hessian approximation and iterative weighted least squares. By discarding noise-prone dependencies, DASH-Q filters sampling noise while prioritizing the preservation of salient feature power. We outperform other PTQ baselines in ultra low-bit regime, improving zero-shot accuracy by 7.01% on average and up to 14.01% over the strongest baselines across five baseline LLM models, while showing robust and stable performance with very small calibration data.

关键词: Large Language Models, Post-Training Quantization, Low-bit Quantization, Hessian-based Methods, Diagonal Hessian Approximation, Quantization Error, Calibration Data, Zero-shot Accuracy

257. ❌ Character Beyond Speech: Leveraging Role-Playing Evaluation in Audio Large Language Models via Reinforcement Learning

作者: Dongjie Fu, Fangming Feng, Xize Cheng, Linjun Li, Zhou Zhao, Tao Jin 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13804v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究音频大语言模型在角色扮演评估中的应用，与LLMs、Alignment、RLHF、Chain of Thought、LLM Agents高度相关（10分），因为这些是论文的核心技术和方法。其他关键词如MoE、SLMs、Scaling Laws等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了RoleJudge评估框架，利用音频大语言模型和强化学习中的标准对齐方法，通过多阶段训练范式系统评估语音角色扮演代理的角色对齐效果，实验证明其优于基线模型。

摘要翻译

多模态大模型的快速发展革新了语音对话系统中多样化角色的模拟能力，催生出全新的交互范式。角色属性不仅体现在文本回应中，更通过语音特征得以呈现——语音承载着丰富的副语言信息，这些信息往往难以量化。这为评估角色扮演智能体的角色一致性带来了显著挑战。为应对这些难题，我们提出RoleJudge评估框架，该框架利用音频大语言模型（Audio Large Language Models）系统性地从多模态、多维度评估语音与角色的契合度。此外，我们构建了首个包含思维链推理标注的语音角色扮演评估数据集RoleChat，该数据集涵盖多样化真实语音与大语言模型生成语音样本。基于此数据集，我们采用多阶段训练范式，并在强化学习中引入标准对齐机制以优化过程中的奖励失准问题。准确率与主观评估的实验结果表明，RoleJudge在多项指标上优于各类基线模型，验证了我们多维评估框架的有效性。

摘要 (Abstract)

The rapid evolution of multimodal large models has revolutionized the simulation of diverse characters in speech dialogue systems, enabling a novel interactive paradigm. Character attributes are manifested not only in textual responses but also through vocal features, as speech conveys rich paralinguistic information that is challenging to quantify. This poses significant difficulties in evaluating the character alignment of role-playing agents. To address these challenges, we present RoleJudge, an evaluation framework that leverages audio large language models to systematically assess the alignment between speech and character across multiple modalities and dimensions. Furthermore, we introduce RoleChat, the first voice role-playing evaluation dataset enriched with chain-of-thought reasoning annotations, comprising a diverse set of authentic and LLM-generated speech samples. Utilizing this dataset, we implement a multi-stage training paradigm and incorporate Standard Alignment in reinforcement learning to mitigate reward misalignment during optimization. Experimental results in terms of accuracy and subjective assessment demonstrate that RoleJudge outperforms various baseline models, validating the effectiveness of our multidimensional evaluation framework.

关键词: Audio Large Language Models, Role-Playing Evaluation, Reinforcement Learning, Chain-of-Thought Reasoning, Multimodal Evaluation, Character Alignment, Standard Alignment, Speech Dialogue Systems

258. ❌ Online learning with noisy side observations

作者: Tomáš Kocák, Gergely Neu, Michal Valko 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13740v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究在线学习中的部分可观测模型，关注带噪声的侧向反馈和基于图结构的反馈共享，属于经典在线学习/多臂赌博机理论领域。论文内容完全不涉及大模型、深度学习、AI for Science或任何评分关键词中的技术（如MoE、RLHF、RAG、量化等），也未提及任何大模型在不同领域的应用。所有关键词均与论文主题无关，因此全部评分为0。

!!! tip deepseek-chat TL;DR

该论文研究在线学习中的部分可观测问题，提出了一种基于加权有向图表示反馈结构的新模型，并设计了一种无需参数的算法，在T轮后实现了与图的有效独立数相关的遗憾界。

摘要翻译

我们针对在线学习问题提出了一种新的部分可观测性模型，在该模型中，学习者除了自身的损失外，还能根据问题底层结构观测到关于其他动作的噪声反馈。我们通过加权有向图来表示这一结构，其中边的权重与相连节点间共享反馈的质量相关。我们的主要贡献是一种高效算法，该算法保证在$T$轮后实现$\widetilde{O}(\sqrt{α^* T})$的遗憾界，其中$α^$是我们提出的新图属性，称为有效独立数。我们的算法完全无需参数，且不需要了解（甚至估计）$α^$。对于边权重为二元的特殊情况，我们的设定可简化为Mannor与Shamir（2011）以及Alon等人（2013）的部分可观测性模型，且我们的算法恢复了近乎最优的遗憾界。

摘要 (Abstract)

We propose a new partial-observability model for online learning problems where the learner, besides its own loss, also observes some noisy feedback about the other actions, depending on the underlying structure of the problem. We represent this structure by a weighted directed graph, where the edge weights are related to the quality of the feedback shared by the connected nodes. Our main contribution is an efficient algorithm that guarantees a regret of $\widetilde{O}(\sqrt{α^* T})$ after $T$ rounds, where $α^$ is a novel graph property that we call the effective independence number. Our algorithm is completely parameter-free and does not require knowledge (or even estimation) of $α^$. For the special case of binary edge weights, our setting reduces to the partial-observability models of Mannor and Shamir (2011) and Alon et al. (2013) and our algorithm recovers the near-optimal regret bounds.

关键词: online learning, partial observability, noisy feedback, graph structure, regret bound, effective independence number, parameter-free algorithm

259. ❌ Spectral Thompson sampling

作者: Tomas Kocak, Michal Valko, Remi Munos, Shipra Agrawal 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13739v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是多臂老虎机问题中的Spectral Thompson Sampling算法，属于经典强化学习/在线学习领域，与所有关键词（均围绕大模型、深度学习技术及其应用）完全无关。

!!! tip deepseek-chat TL;DR

该论文提出并分析了SpectralTS算法，用于解决图结构约束下的多臂老虎机问题，证明了其遗憾界为d*sqrt(T ln N)，并在合成和真实数据上验证了其计算效率和竞争力。

摘要翻译

汤普森采样（Thompson Sampling，TS）因其良好的实证性能（尤其在计算广告领域）而备受关注。尽管取得了成功，但其性能分析工具直到近期才出现。本文针对一种赌博机问题描述并分析了SpectralTS算法，其中各选项的收益在给定底层图结构下具有平滑性。在此设定中，每个选项对应图中的一个节点，且相邻节点的期望收益被假定为相似。虽然该设定在推荐系统和广告领域均有应用，但传统算法会随选项数量增加而显著降低效率。为此，我们引入一个有效维度d，该维度在现实世界的图中通常较小。我们通过分析证明，SpectralTS的遗憾值在高概率下以dsqrt(T ln N)的尺度增长，其中T为时间范围，N为选项数量。由于dsqrt(T ln N)的遗憾度与已知研究结果相当，SpectralTS提供了一种计算效率更高的替代方案。我们还通过合成数据与真实数据验证了该算法的竞争力。

摘要 (Abstract)

Thompson Sampling (TS) has attracted a lot of interest due to its good empirical performance, in particular in the computational advertising. Though successful, the tools for its performance analysis appeared only recently. In this paper, we describe and analyze SpectralTS algorithm for a bandit problem, where the payoffs of the choices are smooth given an underlying graph. In this setting, each choice is a node of a graph and the expected payoffs of the neighboring nodes are assumed to be similar. Although the setting has application both in recommender systems and advertising, the traditional algorithms would scale poorly with the number of choices. For that purpose we consider an effective dimension d, which is small in real-world graphs. We deliver the analysis showing that the regret of SpectralTS scales as dsqrt(T ln N) with high probability, where T is the time horizon and N is the number of choices. Since a dsqrt(T ln N) regret is comparable to the known results, SpectralTS offers a computationally more efficient alternative. We also show that our algorithm is competitive on both synthetic and real-world data.

关键词: Thompson Sampling, bandit problem, graph structure, regret analysis, computational efficiency, recommender systems, advertising, spectral methods

260. ❌ Covariance-adapting algorithm for semi-bandits with application to sparse rewards

作者: Pierre Perrault, Vianney Perchet, Michal Valko 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13738v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是随机组合半赌博机问题，专注于算法理论、统计学习和优化领域，涉及协方差适应算法、稀疏奖励和推荐系统应用。论文内容完全不涉及大模型、深度学习、AI技术原理或科学AI应用，与所有评分关键词均无关联。

!!! tip deepseek-chat TL;DR

该论文研究了随机组合半赌博机问题，提出了一种协方差适应算法来处理子指数分布族，证明了基于未知协方差矩阵的遗憾下界，并将结果扩展到稀疏奖励场景，应用于推荐系统。

摘要翻译

我们研究随机组合半赌博机问题，其中结果的全部联合分布会影响问题实例的复杂度（这与标准赌博机不同）。通常考虑的概率分布依赖于特定参数值，理论上需要先验知识，但在实践中很难估计；例如常用的亚高斯分布族。我们通过引入一个新的广义亚指数分布族来缓解这一问题，该族包含有界分布和高斯分布。我们证明了在该分布族上期望遗憾的一个新下界，该下界由结果未知的协方差矩阵参数化，这是一个比亚高斯矩阵更紧的量。随后，我们构建了一种利用协方差估计的算法，并对遗憾进行了紧的渐近分析。最后，我们将结果应用并推广到稀疏结果分布族，这在许多推荐系统中具有应用价值。

摘要 (Abstract)

We investigate stochastic combinatorial semi-bandits, where the entire joint distribution of outcomes impacts the complexity of the problem instance (unlike in the standard bandits). Typical distributions considered depend on specific parameter values, whose prior knowledge is required in theory but quite difficult to estimate in practice; an example is the commonly assumed sub-Gaussian family. We alleviate this issue by instead considering a new general family of sub-exponential distributions, which contains bounded and Gaussian ones. We prove a new lower bound on the expected regret on this family, that is parameterized by the unknown covariance matrix of outcomes, a tighter quantity than the sub-Gaussian matrix. We then construct an algorithm that uses covariance estimates, and provide a tight asymptotic analysis of the regret. Finally, we apply and extend our results to the family of sparse outcomes, which has applications in many recommender systems.

关键词: stochastic combinatorial semi-bandits, covariance-adapting algorithm, sub-exponential distributions, regret analysis, sparse rewards, recommender systems, covariance matrix, lower bound

261. ❌ Reachability Constraints in Variational Quantum Circuits: Optimization within Polynomial Group Module

作者: Yun-Tak Oh, Dongsoo Lee, Jungyoul Park, Kyung Chul Jeong, Panjin Kim 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13735v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究变分量子电路中的可达性约束，属于量子计算理论领域，与所有评分关键词（均涉及大模型、深度学习技术及其应用）完全无关。论文未涉及任何大模型技术、训练方法、推理优化、对齐技术、代理系统或科学AI应用。

!!! tip deepseek-chat TL;DR

该论文研究了变分量子电路达到精确基态的必要条件，发现解状态的模块权重必须预先已知，并以最大割问题为例展示了某些问题存在经典替代解。

摘要翻译

本研究提出了变分量子方法达到精确基态的一个必要条件。简言之，输入态与基态在各群模上的投影范数必须匹配，这意味着若要达到精确基态，必须预先获知解态的模权重。一个典型示例是将匹配门电路应用于解为经典比特串的问题，因为所有计算基态均具有相同的模权重。结合已知的可观测量位于较小线性子空间的量子电路的经典可模拟性，这意味着某些问题存在精确求解的经典替代方法，其每一步骤的时间复杂度为 $O(n^5)$。最大割问题可作为具体示例加以说明。

摘要 (Abstract)

This work identifies a necessary condition for any variational quantum approach to reach the exact ground state. Briefly, the norms of the projections of the input and the ground state onto each group module must match, implying that module weights of the solution state have to be known in advance in order to reach the exact ground state. An exemplary case is provided by matchgate circuits applied to problems whose solutions are classical bit strings, since all computational basis states share the same module-wise weights. Combined with the known classical simulability of quantum circuits for which observables lie in a small linear subspace, this implies that certain problems admit a classical surrogate for exact solution with each step taking $O(n^5)$ time. The Maximum Cut problem serves as an illustrative example.

关键词: Variational Quantum Circuits, Reachability Constraints, Ground State, Polynomial Group Module, Matchgate Circuits, Classical Simulability, Maximum Cut Problem, Exact Solution

262. ❌ Physics-Informed Neural Networks for Solving Derivative-Constrained PDEs

作者: Kentaro Hoshisashi, Carolyn E Phelan, Paolo Barucca 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13723v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究Physics-Informed Neural Networks (PINNs)及其扩展DC-PINNs，专注于求解带导数约束的偏微分方程，属于AI在科学计算领域的应用。所有关键词均与大模型、深度学习技术原理或特定AI应用领域相关，但论文未涉及大模型、语言模型、训练对齐、推理优化、智能体等主题，仅与’AI for Science’有一定关联（应用于物理和金融领域的PDE求解），因此该关键词得5分，其余得0分。

!!! tip deepseek-chat TL;DR

该论文提出了Derivative-Constrained PINNs (DC-PINNs)框架，通过嵌入状态和导数的非线性约束并采用自适应损失平衡，有效求解带约束的偏微分方程，在多个基准测试中减少了约束违反并提高了物理保真度。

摘要翻译

物理信息神经网络（PINNs）通过最小化基于残差的目标函数，将偏微分方程求解重构为函数空间中的优化问题。然而，许多应用需要额外的基于导数的关系，这些关系与控制方程本身同等重要。本文提出导数约束物理信息神经网络（DC-PINNs），这是一个通用框架，将约束偏微分方程求解视为由最小目标函数准则引导的优化过程，其中物理规律体现在最小原理中。DC-PINNs嵌入了对状态变量及其导数的一般非线性约束（例如边界约束、单调性、凸性、不可压缩性），并通过自动微分高效计算；同时采用自适应损失平衡机制来调节各目标项的影响，减少对人工超参数和问题特定架构的依赖。在基准测试中（包括带边界约束的热扩散问题、无套利约束的金融波动率问题以及带涡旋脱落的流体流动问题），DC-PINNs相较于基线PINN变体及代表性硬约束方法，能持续降低约束违反程度并提升物理保真度。显式编码导数约束可稳定训练过程，并在偏微分方程残差本身较小时仍引导优化朝向物理允许的最小值，从而基于能量最小原理为约束偏微分方程提供可靠解。

摘要 (Abstract)

Physics-Informed Neural Networks (PINNs) recast PDE solving as an optimisation problem in function space by minimising a residual-based objective, yet many applications require additional derivative-based relations that are just as fundamental as the governing equations. In this paper, we present Derivative-Constrained PINNs (DC-PINNs), a general framework that treats constrained PDE solving as an optimisation guided by a minimum objective function criterion where the physics resides in the minimum principle. DC-PINNs embed general nonlinear constraints on states and derivatives, e.g., bounds, monotonicity, convexity, incompressibility, computed efficiently via automatic differentiation, and they employ self-adaptive loss balancing to tune the influence of each objective, reducing reliance on manual hyperparameters and problem-specific architectures. DC-PINNs consistently reduce constraint violations and improve physical fidelity versus baseline PINN variants, representative hard-constraint formulations on benchmarks, including heat diffusion with bounds, financial volatilities with arbitrage-free, and fluid flow with vortices shed. Explicitly encoding derivative constraints stabilises training and steers optimisation toward physically admissible minima even when the PDE residual alone is small, providing reliable solutions of constrained PDEs grounded in energy minimum principles.

关键词: Physics-Informed Neural Networks, PINNs, Derivative-Constrained PDEs, Optimization, Automatic Differentiation, Self-adaptive Loss Balancing, Physical Fidelity, Energy Minimum Principles

263. ❌ VIGILant: an automatic classification pipeline for glitches in the Virgo detector

作者: Tiago Fernandes, Francesco Di Renzo, Antonio Onofre, Alejandro Torres-Forné, José A. Font 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13687v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究引力波探测器中的故障分类问题，使用传统机器学习方法（决策树、随机森林、XGBoost）和卷积神经网络（ResNet），不涉及大语言模型、深度学习技术原理创新或任何评分关键词中的具体技术。仅与’AI for Science’有一定关联，因为属于科学领域的AI应用，但并非大模型在科学领域的应用，因此给5分。其他关键词均完全无关，给0分。

!!! tip deepseek-chat TL;DR

该论文针对Virgo引力波探测器中的故障分类问题，开发了VIGILant自动分类流水线，通过比较树模型和ResNet34模型，发现ResNet34在测试集上达到0.9772的F1分数和0.9833的准确率，并已部署用于日常监测。

摘要翻译

引力波探测器的数据常受瞬态干扰信号污染，这增加了天体物理信号观测与分析的复杂性。本研究介绍了VIGILant——一个用于Virgo探测器瞬态干扰信号自动分类与可视化的处理流程。基于精选的Virgo O3b阶段瞬态干扰信号数据集，我们评估了两种机器学习方法：其一是使用结构化Omicron参数训练的树状模型（决策树、随机森林与XGBoost），其二是基于频谱图训练的卷积神经网络（ResNet）。虽然树状模型具有更高的可解释性与快速训练优势，但ResNet34模型在测试集中表现出更优性能，其F1分数达0.9772，准确率达0.9833，单个干扰信号的推理时间仅需数十毫秒。该流程自观测阶段O4c起已在Virgo站点投入日常运行，通过交互式仪表盘为Virgo合作组提供瞬态干扰信号群体分布与探测器行为监控功能，能够识别低置信度预测结果，从而标定需要进一步关注的异常干扰信号。

摘要 (Abstract)

Glitches frequently contaminate data in gravitational-wave detectors, complicating the observation and analysis of astrophysical signals. This work introduces VIGILant, an automatic pipeline for classification and visualization of glitches in the Virgo detector. Using a curated dataset of Virgo O3b glitches, two machine learning approaches are evaluated: tree-based models (Decision Tree, Random Forest and XGBoost) using structured Omicron parameters, and Convolutional Neural Networks (ResNet) trained on spectrogram images. While tree-based models offer higher interpretability and fast training, the ResNet34 model achieved superior performance, reaching a F1 score of 0.9772 and accuracy of 0.9833 in the testing set, with inference times of tens of milliseconds per glitch. The pipeline has been deployed for daily operation at the Virgo site since observing run O4c, providing the Virgo collaboration with an interactive dashboard to monitor glitch populations and detector behavior. This allows to identify low-confidence predictions, highlighting glitches requiring further attention.

关键词: gravitational-wave detectors, glitch classification, machine learning, convolutional neural networks, ResNet, Virgo detector, automatic pipeline, spectrogram analysis

264. ❌ EMGFlow: Robust and Efficient Surface Electromyography Synthesis via Flow Matching

作者: Boxuan Jiang, Chenyun Dai, Can Han 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13685v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究基于深度学习的表面肌电信号（sEMG）合成，属于生物医学AI应用领域，与关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（评分5分），因为sEMG属于生物信号处理，可视为生物信息学或AI for Science的应用。但论文未涉及大模型（LLMs）、MoE、SLMs、缩放定律、预训练/后训练、对齐、RLHF、PEFT、RAG、上下文扩展、KV缓存压缩、推理技术（如思维链、系统2思维、MCTS）、自校正、智能体、工具使用、多智能体系统、量化、推测解码、幻觉缓解、可解释性、世界模型、模型合并、上下文学习等大模型相关技术，因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出EMGFlow，一种基于Flow Matching的条件性表面肌电信号生成框架，旨在解决sEMG手势识别中的数据稀缺问题，实验表明其在合成数据质量和效率上优于传统生成方法。

摘要翻译

基于深度学习的表面肌电信号手势识别常受限于数据稀缺与受试者多样性不足。生成对抗网络与扩散模型虽已成为前景广阔的合成数据生成增强策略，但其在训练稳定性或推理效率方面仍面临挑战。为弥补这一差距，我们提出EMGFlow——一个条件式表面肌电信号生成框架。据我们所知，本研究首次探索了流匹配与连续时间生成建模在表面肌电信号领域的应用。为在三个基准表面肌电数据集上验证EMGFlow，我们采用整合了基于特征保真度、分布几何特性与下游实用性的统一评估方案。大量实验表明：EMGFlow在合成数据训练-真实数据测试范式下，优于传统数据增强方法与生成对抗网络基线模型，并较所考察的扩散模型基线展现出更强的独立实用性。此外，通过采用先进数值求解器与针对性时间采样策略优化生成动态过程，EMGFlow实现了更优的质量-效率平衡。综合而言，这些结果表明流匹配是解决肌电控制系统数据瓶颈问题的高效潜力范式。代码已开源：https://github.com/Open-EXG/EMGFlow。

摘要 (Abstract)

Deep learning-based surface electromyography (sEMG) gesture recognition is frequently bottlenecked by data scarcity and limited subject diversity. While synthetic data generation via Generative Adversarial Networks (GANs) and diffusion models has emerged as a promising augmentation strategy, these approaches often face challenges regarding training stability or inference efficiency. To bridge this gap, we propose EMGFlow, a conditional sEMG generation framework. To the best of our knowledge, this is the first study to investigate the application of Flow Matching (FM) and continuous-time generative modeling in the sEMG domain. To validate EMGFlow across three benchmark sEMG datasets, we employ a unified evaluation protocol integrating feature-based fidelity, distributional geometry, and downstream utility. Extensive evaluations show that EMGFlow outperforms conventional augmentation and GAN baselines, and provides stronger standalone utility than the diffusion baselines considered here under the train-on-synthetic test-on-real (TSTR) protocol. Furthermore, by optimizing generation dynamics through advanced numerical solvers and targeted time sampling, EMGFlow achieves improved quality-efficiency trade-offs. Taken together, these results suggest that Flow Matching is a promising and efficient paradigm for addressing data bottlenecks in myoelectric control systems. Our code is available at: https://github.com/Open-EXG/EMGFlow.

关键词: surface electromyography, sEMG, Flow Matching, generative modeling, data augmentation, gesture recognition, myoelectric control, synthetic data generation

265. ❌ node2vec or triangle-biased random walks: stationarity, regularity & recurrence

作者: Luca Avena, Gianmarco Bet, Lars Schroeder, Clara Stegehuis 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13681v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是图论中的node2vec随机游走模型，属于数学和网络科学领域，主要分析该模型的马尔可夫性质、遍历性、可逆性和不变测度等理论特性。所有评分关键词均与大语言模型、深度学习技术原理或AI在科学领域的应用相关，而本文完全不涉及这些主题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

本文研究了node2vec随机游走模型在任意图上的长期行为，通过将其提升到有向边和有向楔形状态空间得到两个马尔可夫表示，并找到了保证遍历性、可逆性、递归性和不变测度表征的充分条件。

摘要翻译

node2vec随机游走是图顶点集上的一种非马尔可夫随机游走模型，广泛应用于网络嵌入与探索。该随机游走模型由三个参数定义，分别控制回溯移动、三角形内移动以及向其余相邻节点移动的概率。从数学角度看，node2vec随机游走是非回溯随机游走的重要推广，因此属于二阶马尔可夫链范畴。尽管该方法在应用中已被广泛采用，其长期行为特性却鲜为人知。本文旨在开始探索其在任意图上的基本性质。为此，我们展示了如何将node2vec随机游走提升至有向边和有向楔形（directed wedges）的状态空间，从而得到两个不同的马尔可夫表示，这对渐近分析至关重要。利用这些表示，我们找到了保证有限图或无限图上遍历性、可逆性、递归性及不变测度表征的温和充分条件。正如我们所讨论的，node2vec随机游走的行为与非回溯随机游走存在显著差异。后者通过其天然的边马尔可夫表示在任意图上因双随机性而简化，而前者则在正则图上通过其天然的楔形马尔可夫表示得以简化。值得注意的是，该表示揭示了一个图是正则的当且仅当满足某个加权的欧拉性条件。

摘要 (Abstract)

The node2vec random walk is a non-Markovian random walk on the vertex set of a graph, widely used for network embedding and exploration. This random walk model is defined in terms of three parameters which control the probability of, respectively, backtracking moves, moves within triangles, and moves to the remaining neighboring nodes. From a mathematical standpoint, the node2vec random walk is a nontrivial generalization of the non-backtracking random walk and thus belongs to the class of second-order Markov chains. Despite its widespread use in applications, little is known about its long-run behavior. The goal of this paper is to begin exploring its fundamental properties on arbitrary graphs. To this aim, we show how lifting the node2vec random walk to the state spaces of directed edges and directed wedges yields two distinct Markovian representations which are key for its asymptotic analysis. Using these representations, we find mild sufficient conditions on the underlying finite or infinite graph to guarantee ergodicity, reversibility, recurrence and characterization of the invariant measure. As we discuss, the behavior of the node2vec random walk is drastically different compared to the non-backtracking random walk. While the latter simplifies on arbitrary graphs when using its natural edge Markovian representation thanks to bistochasticity, the former simplifies on regular graphs when using its natural wedge Markovian representation. Remarkably, this representation reveals that a graph is regular if and only if a certain weighted Eulerianity condition holds.

关键词: node2vec, random walk, Markov chain, ergodicity, reversibility, recurrence, invariant measure, graph theory

266. ❌ Optimization with SpotOptim

作者: Thomas Bartz-Beielstein 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13672v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《Optimization with SpotOptim》专注于开发一个通用的、基于代理模型的优化框架（spotoptim包），用于昂贵黑盒函数的优化。其核心内容包括：Kriging代理模型、期望提升（Expected Improvement）、支持多种变量类型、噪声感知评估（OCBA）、多目标扩展、并行化策略和重启机制。论文的应用示例包括神经网络超参数调优，并与BoTorch、Optuna等现有优化框架进行比较。所有评分关键词均与大语言模型（LLM）、深度学习技术原理或AI在科学领域的特定应用（如生物信息学）直接相关。而本文的研究内容属于通用的优化算法和框架开发，并未涉及任何大模型、深度学习技术或其在科学领域的应用创新，因此与所有关键词完全无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个名为spotoptim的Python包，用于实现基于代理模型的昂贵黑盒函数优化，提供了包括Kriging模型、期望提升、多变量支持、噪声处理和多目标扩展在内的完整优化框架，并通过神经网络超参数调优等示例验证了其有效性。

摘要翻译

spotoptim软件包在Python中实现了基于代理模型的昂贵黑盒函数优化。该工具基于二十年来的序列参数优化（Sequential Parameter Optimization，SPO）方法论，提供了基于克里金模型的优化循环（采用期望提升采集函数），支持连续、整数和分类变量，通过最优计算预算分配（Optimal Computing Budget Allocation，OCBA）实现噪声感知评估，并具备多目标优化扩展功能。其稳态并行化策略可在多核硬件上实现代理模型搜索与目标函数评估的重叠执行，而基于成功率的重启机制能在保留已找到最优解的同时检测优化停滞。该软件包返回与scipy兼容的OptimizeResult对象，并兼容任何符合scikit-learn接口规范的代理模型。内置的TensorBoard日志记录功能可实时监控收敛过程与代理模型质量。本报告阐述了spotoptim的架构与模块结构，提供了包含神经网络超参数调优在内的应用示例，并将该框架与BoTorch、Optuna、Ray Tune、BOHB、SMAC及Hyperopt进行了对比。该软件包为开源项目。

摘要 (Abstract)

The spotoptim package implements surrogate-model-based optimization of expensive black-box functions in Python. Building on two decades of Sequential Parameter Optimization (SPO) methodology, it provides a Kriging-based optimization loop with Expected Improvement, support for continuous, integer, and categorical variables, noise-aware evaluation via Optimal Computing Budget Allocation (OCBA), and multi-objective extensions. A steady-state parallelization strategy overlaps surrogate search with objective evaluation on multi-core hardware, and a success-rate-based restart mechanism detects stagnation while preserving the best solution found. The package returns scipy-compatible OptimizeResult objects and accepts any scikit-learn-compatible surrogate model. Built-in TensorBoard logging provides real-time monitoring of convergence and surrogate quality. This report describes the architecture and module structure of spotoptim, provides worked examples including neural network hyperparameter tuning, and compares the framework with BoTorch, Optuna, Ray Tune, BOHB, SMAC, and Hyperopt. The package is open-source.

关键词: surrogate-model-based optimization, expensive black-box functions, Sequential Parameter Optimization (SPO), Kriging, Expected Improvement, Optimal Computing Budget Allocation (OCBA), hyperparameter tuning, optimization framework

267. ❌ A Bayesian Framework for Uncertainty-Aware Explanations in Power Quality Disturbance Classification

作者: Yinsong Chen, Samson S. Yu, Kashem M. Muttaqi 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13658v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于电力质量扰动分类中的可解释AI（XAI）方法，特别是提出了一种贝叶斯框架来建模解释不确定性。这与关键词’Mechanistic Interpretability OR Explainable AI’高度相关（10分），因为这是论文的核心技术贡献。论文应用深度学习于电力系统（一个科学/工程领域），与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），但并非核心的生物信息学或化学信息学。其他所有关键词均涉及大模型（LLM）相关技术（如预训练、对齐、推理加速等）或特定方法（如MoE、RAG、CoT），而本文研究的是传统深度学习分类器的可解释性，未涉及任何大模型技术或这些特定方法，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对电力质量扰动分类中传统可解释AI方法缺乏不确定性的问题，提出了一种贝叶斯解释框架，通过生成实例相关的归因分布来建模解释不确定性，从而提高了分类器的透明度和可靠性。

摘要翻译

先进深度学习方法在电能质量扰动分类中已展现出显著成效。为提升模型透明度，可解释人工智能技术被开发用于提供针对具体实例的分类器决策解释。然而，传统可解释人工智能方法仅生成确定性解释，忽略了不确定性，在安全关键应用中限制了可靠性。本文提出一种贝叶斯解释框架，通过为每个实例生成相关性归因分布来建模解释的不确定性。该方法允许专家根据置信度百分位数选择解释，从而针对特定扰动类型定制可解释性。在合成与真实电能质量数据集上的大量实验表明，所提出的框架通过不确定性感知解释，提升了电能质量扰动分类器的透明度与可靠性。

摘要 (Abstract)

Advanced deep learning methods have shown remarkable success in power quality disturbance (PQD) classification. To enhance model transparency, explainable AI (XAI) techniques have been developed to provide instance-specific interpretations of classifier decisions. However, conventional XAI methods yield deterministic explanations, overlooking uncertainty and limiting reliability in safety-critical applications. This paper proposes a Bayesian explanation framework that models explanation uncertainty by generating a relevance attribution distribution for each instance. This method allows experts to select explanations based on confidence percentiles, thereby tailoring interpretability according to specific disturbance types. Extensive experiments on synthetic and real-world power quality datasets demonstrate that the proposed framework improves the transparency and reliability of PQD classifiers through uncertainty-aware explanations.

关键词: Explainable AI, Bayesian framework, Uncertainty-aware explanations, Power quality disturbance classification, Deep learning, Model transparency, Relevance attribution distribution

268. ❌ Self-Organizing Maps with Optimized Latent Positions

作者: Seiki Ubukata, Akira Notsu, Katsuhiro Honda 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13622v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是经典的自组织映射（SOM）方法，属于传统的无监督学习、向量量化和拓扑映射领域。论文提出的SOM-OLP方法专注于优化目标函数、计算效率和可扩展性，与所有评分关键词（均围绕大模型、深度学习技术原理及其应用）完全无关。论文未涉及任何大模型、深度学习、AI for Science或相关技术。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于目标的自组织映射方法SOM-OLP，通过引入连续潜在位置和构建可分离的代理局部成本，解决了现有SOM方法在计算效率和优化目标之间的权衡问题，并在实验中展示了良好的邻域保持、量化性能和可扩展性。

摘要翻译

自组织映射（Self-Organizing Maps，SOM）是一种用于高维数据无监督学习、向量量化和拓扑映射的经典方法。然而，现有的SOM框架往往需要在计算效率与清晰定义的优化目标之间进行权衡。基于目标函数的变体，如软拓扑向量量化（Soft Topographic Vector Quantization，STVQ），提供了原理性的表述，但其邻域耦合计算会随着潜在节点数量的增加而变得昂贵。本文提出了一种基于目标函数的拓扑映射方法——具有优化潜在位置的自组织映射（SOM-OLP），该方法为每个数据点引入了一个连续的潜在位置。从STVQ的邻域失真出发，我们基于其局部二次结构构建了一个可分离的代理局部代价，并据此构建了一个熵正则化的目标函数。这产生了一个简单的块坐标下降方案，其中分配概率、潜在位置和参考向量的更新具有闭式解，同时保证了目标函数的单调非增，并保持了在数据点数量和潜在节点数量上的线性每轮迭代复杂度。在合成鞍形流形上的实验、在Digits和MNIST数据集上的可扩展性研究，以及在16个基准数据集上的测试表明，SOM-OLP在邻域保持和量化性能方面具有竞争力，对于大量潜在节点和大规模数据集展现出良好的可扩展性，并且在基准数据集上的比较方法中获得了最佳的平均排名。

摘要 (Abstract)

Self-Organizing Maps (SOM) are a classical method for unsupervised learning, vector quantization, and topographic mapping of high-dimensional data. However, existing SOM formulations often involve a trade-off between computational efficiency and a clearly defined optimization objective. Objective-based variants such as Soft Topographic Vector Quantization (STVQ) provide a principled formulation, but their neighborhood-coupled computations become expensive as the number of latent nodes increases. In this paper, we propose Self-Organizing Maps with Optimized Latent Positions (SOM-OLP), an objective-based topographic mapping method that introduces a continuous latent position for each data point. Starting from the neighborhood distortion of STVQ, we construct a separable surrogate local cost based on its local quadratic structure and formulate an entropy-regularized objective based on it. This yields a simple block coordinate descent scheme with closed-form updates for assignment probabilities, latent positions, and reference vectors, while guaranteeing monotonic non-increase of the objective and retaining linear per-iteration complexity in the numbers of data points and latent nodes. Experiments on a synthetic saddle manifold, scalability studies on the Digits and MNIST datasets, and 16 benchmark datasets show that SOM-OLP achieves competitive neighborhood preservation and quantization performance, favorable scalability for large numbers of latent nodes and large datasets, and the best average rank among the compared methods on the benchmark datasets.

关键词: Self-Organizing Maps, unsupervised learning, vector quantization, topographic mapping, optimization objective, computational efficiency, scalability, latent positions

269. ❌ Irregularly Sampled Time Series Interpolation for Binary Evolution Simulations Using Dynamic Time Warping

作者: Ugur Demir, Philipp M. Srivastava, Aggelos Katsaggelos, Vicky Kalogera, Santiago L. Tapia, Manuel Ballester, Shamal Lalvani, Patrick Koller, Jeff J. Andrews, Seth Gossage, Max M. Briel, Elizabeth Teng 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13604v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于天体物理学中双星演化模拟的时间序列插值方法，使用动态时间规整（DTW）技术解决轨道对齐问题。论文内容与绝大多数关键词（涉及大模型、深度学习技术原理、训练方法、推理优化、对齐、智能体等）完全无关。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在科学领域的应用（天体物理学），但论文并未使用深度学习或大模型，而是基于传统的DTW算法，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于动态时间规整的联合对齐和迭代轨道平均新方法，用于解决双星演化模拟中因恒星相互作用导致的轨道不对齐问题，从而生成更准确的双星种群样本用于天体物理研究。

摘要翻译

双星演化模拟的计算成本高昂。恒星种群合成从根本上依赖于这些详细的演化模型。生成数千个此类模型需要数百CPU小时，但恒星轨迹插值为显著降低这一计算成本提供了一种途径。尽管单星轨迹插值较为直接，但双星系统中的恒星相互作用为双星演化引入了显著的复杂性，使得传统的单星轨迹插值方法不再适用。与单星相比，双星轨迹带来了根本不同的挑战：单星具有相对直接的演化阶段，可通过不同的物理性质进行识别；而双星系统则因相互作用的复杂性而变得棘手，这些相互作用可能剧烈改变演化轨迹，并引入难以通过标准插值捕捉的不连续性。在本研究中，我们提出了一种基于动态时间规整（Dynamic Time Warping）的新方法，用于轨迹对齐和迭代轨迹平均，以解决相邻轨迹间的错位问题。我们的方法同时计算所有物理参数的单一共享规整路径，将其置于一致的时间网格上，从而保持参数间的因果关系。我们证明，这种联合对齐策略在插值轨迹中保持了关键物理关系（如斯特藩-玻尔兹曼定律）。通过对多种双星构型的全面评估，我们证明适当的时间对齐对于轨迹插值方法至关重要。所提出的方法 consistently 优于现有方法，并能为天体物理研究高效生成更准确的双星种群样本。

摘要 (Abstract)

Binary stellar evolution simulations are computationally expensive. Stellar population synthesis relies on these detailed evolution models at a fundamental level. Producing thousands of such models requires hundreds of CPU hours, but stellar track interpolation provides one approach to significantly reduce this computational cost. Although single-star track interpolation is straightforward, stellar interactions in binary systems introduce significant complexity to binary evolution, making traditional single-track interpolation methods inapplicable. Binary tracks present fundamentally different challenges compared to single stars, which possess relatively straightforward evolutionary phases identifiable through distinct physical properties. Binary systems are complicated by mutual interactions that can dramatically alter evolutionary trajectories and introduce discontinuities difficult to capture through standard interpolation. In this work, we introduce a novel approach for track alignment and iterative track averaging based on Dynamic Time Warping to address misalignments between neighboring tracks. Our method computes a single shared warping path across all physical parameters simultaneously, placing them on a consistent temporal grid that preserves the causal relationships between parameters. We demonstrate that this joint-alignment strategy maintains key physical relationships such as the Stefan-Boltzmann law in the interpolated tracks. Our comprehensive evaluation across multiple binary configurations demonstrates that proper temporal alignment is crucial for track interpolation methods. The proposed method consistently outperforms existing approaches and enables the efficient generation of more accurate binary population samples for astrophysical studies.

关键词: binary stellar evolution, time series interpolation, dynamic time warping, track alignment, stellar population synthesis, computational astrophysics, simulation acceleration

270. ❌ Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

作者: Xiaohua Wang, Muzhao Tian, Yuqi Zeng, Zisu Huang, Jiakang Yuan, Bowen Chen, Jingwen Xu, Mingbo Zhou, Wenhao Liu, Muling Wu, Zhengkang Guo, Qi Qian, Yifei Wang, Feiran Zhang, Ruicheng Yin, Shihan Dou, Changze Lv, Tao Chen, Kaitao Song, Xu Tan, Tao Gui, Xiaoqing Zheng, Xuanjing Huang 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13602v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究RLHF/RLAIF对齐范式中的奖励黑客问题，与LLMs、对齐、RLHF/RLAIF/DPO高度相关（10分）。涉及模型规模扩大导致的系统性漏洞，与Scaling Laws相关（5分）。讨论监督微调（SFT）作为对齐基础（5分）。奖励黑客可能导致自我改进偏差（5分）和代理行为（5分）。与幻觉缓解相关（8分），因为奖励黑客包括幻觉合理化。需要可解释性分析机制（5分）。其他关键词如MoE、SLMs、RAG等未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文系统研究了大型语言模型对齐过程中出现的奖励黑客问题，提出了代理压缩假设框架来解释其机制，并组织了检测和缓解策略。

摘要翻译

基于人类反馈的强化学习（RLHF）及其相关对齐范式已成为引导大语言模型（LLMs）与多模态大语言模型（MLLMs）符合人类偏好行为的核心方法。然而，这些方法引入了一种系统性漏洞：奖励破解，即模型利用习得奖励信号中的缺陷来最大化代理目标，却未真正实现任务意图。随着模型规模扩大与优化强度提升，此类利用行为表现为冗长偏好、谄媚倾向、幻觉合理化、基准过拟合，以及在多模态场景下的感知-推理脱节与评估器操纵。最新证据进一步表明，看似良性的捷径行为可能泛化为更广泛的对齐失范形式，包括欺骗与对监督机制的策略性博弈。本综述提出“代理压缩假说”（Proxy Compression Hypothesis, PCH）作为理解奖励破解的统一框架。我们将奖励破解形式化地定义为：针对高维人类目标进行压缩奖励表征优化时，表达能力强的策略所涌现的必然结果。在此视角下，奖励破解源于目标压缩、优化放大以及评估器-策略协同适应三者间的相互作用。这一观点统一了RLHF、RLAIF与RLVR范式中的实证现象，并解释了局部捷径学习如何泛化为更广泛的对齐失范形式，包括欺骗与对监督机制的策略性操纵。我们进一步依据干预压缩、放大或协同适应动态的维度，系统梳理了检测与缓解策略。通过将奖励破解界定为规模化背景下基于代理的对齐方法的结构性失稳，本文强调了可扩展监督、多模态 grounding 与智能体自主性等领域面临的开放挑战。

摘要 (Abstract)

Reinforcement Learning from Human Feedback (RLHF) and related alignment paradigms have become central to steering large language models (LLMs) and multimodal large language models (MLLMs) toward human-preferred behaviors. However, these approaches introduce a systemic vulnerability: reward hacking, where models exploit imperfections in learned reward signals to maximize proxy objectives without fulfilling true task intent. As models scale and optimization intensifies, such exploitation manifests as verbosity bias, sycophancy, hallucinated justification, benchmark overfitting, and, in multimodal settings, perception–reasoning decoupling and evaluator manipulation. Recent evidence further suggests that seemingly benign shortcut behaviors can generalize into broader forms of misalignment, including deception and strategic gaming of oversight mechanisms. In this survey, we propose the Proxy Compression Hypothesis (PCH) as a unifying framework for understanding reward hacking. We formalize reward hacking as an emergent consequence of optimizing expressive policies against compressed reward representations of high-dimensional human objectives. Under this view, reward hacking arises from the interaction of objective compression, optimization amplification, and evaluator–policy co-adaptation. This perspective unifies empirical phenomena across RLHF, RLAIF, and RLVR regimes, and explains how local shortcut learning can generalize into broader forms of misalignment, including deception and strategic manipulation of oversight mechanisms. We further organize detection and mitigation strategies according to how they intervene on compression, amplification, or co-adaptation dynamics. By framing reward hacking as a structural instability of proxy-based alignment under scale, we highlight open challenges in scalable oversight, multimodal grounding, and agentic autonomy.

关键词: Reward Hacking, RLHF, Alignment, Large Language Models, Proxy Compression Hypothesis, Misalignment, Scalable Oversight, Multimodal LLMs

271. ❌ Data-driven Learning of Probabilistic Model of Binary Droplet Collision for Spray Simulation

作者: Weiming Xu, Tao Yang, Peng Zhang 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13594v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文使用LightGBM机器学习方法开发了二元液滴碰撞的概率模型，属于AI在科学领域的应用（具体是流体力学/喷雾模拟），因此仅与’AI for Science OR Bioinformatics OR Cheminformatics’关键词有一定关联（5分），但与所有其他涉及大模型、深度学习技术原理、训练方法、推理优化、智能体等关键词完全无关（0分）。

!!! tip deepseek-chat TL;DR

该研究针对传统确定性模型无法充分表示二元液滴碰撞的过渡和随机行为的问题，开发了首个基于实验数据的概率性高维液滴碰撞模型，使用LightGBM机器学习方法实现了99.2%的分类准确率，并将其转化为适用于喷雾模拟的概率形式。

摘要翻译

二元液滴碰撞在密集喷雾中普遍存在。传统确定性模型无法充分表征二元液滴碰撞的过渡性与随机性行为。为弥补这一不足，本研究采用机器学习方法——轻量梯度提升机（LightGBM）——开发了一种概率模型。该模型基于包含33,540组实验案例的综合性数据集进行训练，这些数据涵盖了韦伯数、奥内佐格数、碰撞参数、尺寸比和环境压力广泛变化范围内的八种碰撞状态。所构建的机器学习分类器能以99.2%的准确率捕捉高度非线性的状态边界，并在过渡区域保持敏感性。为便于在喷雾模拟中实施，该模型被转化为概率形式——多项逻辑回归模型，其保留了93.2%的准确率，并能映射连续的状态间过渡过程。随后通过偏置骰子采样机制将这些概率转化为确定但具有随机性的结果。本研究提出了首个基于实验数据推导的概率性高维液滴碰撞模型，为喷雾模拟提供了物理一致、全面且用户友好的解决方案。

摘要 (Abstract)

Binary droplet collisions are ubiquitous in dense sprays. Traditional deterministic models cannot adequately represent transitional and stochastic behaviors of binary droplet collision. To bridge this gap, we developed a probabilistic model by using a machine learning approach, the Light Gradient-Boosting Machine (LightGBM). The model was trained on a comprehensive dataset of 33,540 experimental cases covering eight collision regimes across broad ranges of Weber number, Ohnesorge number, impact parameter, size ratio, and ambient pressure. The resulting machine learning classifier captures highly nonlinear regime boundaries with 99.2% accuracy and retains sensitivity in transitional regions. To facilitate its implementation in spray simulation, the model was translated into a probabilistic form, a multinomial logistic regression, which preserves 93.2% accuracy and maps continuous inter-regime transitions. A biased-dice sampling mechanism then converts these probabilities into definite yet stochastic outcomes. This work presents the first probabilistic, high-dimensional droplet collision model derived from experimental data, offering a physically consistent, comprehensive, and user-friendly solution for spray simulation.

关键词: binary droplet collision, probabilistic model, spray simulation, LightGBM, machine learning, regime classification, multinomial logistic regression, experimental data

272. ❌ Parameter-efficient Quantum Multi-task Learning

作者: Hevish Cowlessur, Chandra Thapa, Tansu Alpcan, Seyit Camtepe 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13560v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文主要研究量子机器学习中的参数高效多任务学习框架，与大多数关键词无关。唯一高度相关的关键词是’PEFT OR LoRA OR Parameter-efficient Fine-tuning’（评分10），因为论文核心是参数高效学习，提出了量子预测头来减少任务特定参数。‘AI for Science OR Bioinformatics OR Cheminformatics’（评分5）有一定关联，因为论文在医学成像等科学领域应用，但非核心。其他关键词涉及大模型、对齐、推理等，论文未涉及。

!!! tip deepseek-chat TL;DR

该论文提出了一种参数高效的量子多任务学习框架，用全量子预测头替代传统线性头，在多个基准测试中达到或超过经典基线性能，同时显著减少参数数量。

摘要翻译

多任务学习（MTL）通过共享表征联合学习相关任务，从而提升泛化能力和数据效率。在广泛使用的硬参数共享设置中，共享主干网络与任务特定的预测头相结合。然而，任务特定参数可能随任务数量快速增长。因此，设计既能保持任务特异性又能提升参数效率的多任务头，仍是一个关键挑战。在量子机器学习（QML）中，变分量子电路（VQCs）提供了一种紧凑机制，可将经典数据映射到位于高维希尔伯特空间中的量子态，从而在受限参数预算内实现富有表达力的表征。我们提出一种参数高效的量子多任务学习（QMTL）框架，该框架在混合架构中用全量子预测头取代了传统的任务特定线性头。该模型包含一个具有共享的、任务无关的量子编码阶段的VQC，其后接轻量级的任务特定ansatz模块，这些模块能够在保持紧凑参数化的同时实现局部任务适应。在一个共享表征维度随任务数量增长的、受控且容量匹配的设定下，我们的参数缩放分析表明，标准经典头的参数呈二次增长，而所提出的量子头的参数成本仅呈线性增长。我们在涵盖自然语言处理、医学成像和多模态讽刺检测的三个多任务基准上评估了QMTL，其性能达到甚至在某些情况下超越了经典硬参数共享基线，同时始终以显著更少的头部参数优于现有的混合量子MTL模型。我们进一步证明了QMTL在含噪声模拟器和真实量子硬件上的可执行性，展现了其可行性。

摘要 (Abstract)

Multi-task learning (MTL) improves generalization and data efficiency by jointly learning related tasks through shared representations. In the widely used hard-parameter-sharing setting, a shared backbone is combined with task-specific prediction heads. However, task-specific parameters can grow rapidly with the number of tasks. Therefore, designing multi-task heads that preserve task specialization while improving parameter efficiency remains a key challenge. In Quantum Machine Learning (QML), variational quantum circuits (VQCs) provide a compact mechanism for mapping classical data to quantum states residing in high-dimensional Hilbert spaces, enabling expressive representations within constrained parameter budgets. We propose a parameter-efficient quantum multi-task learning (QMTL) framework that replaces conventional task-specific linear heads with a fully quantum prediction head in a hybrid architecture. The model consists of a VQC with a shared, task-independent quantum encoding stage, followed by lightweight task-specific ansatz blocks enabling localized task adaptation while maintaining compact parameterization. Under a controlled and capacity-matched formulation where the shared representation dimension grows with the number of tasks, our parameter-scaling analysis demonstrates that a standard classical head exhibits quadratic growth, whereas the proposed quantum head parameter cost scales linearly. We evaluate QMTL on three multi-task benchmarks spanning natural language processing, medical imaging, and multimodal sarcasm detection, where we achieve performance comparable to, and in some cases exceeding, classical hard-parameter-sharing baselines while consistently outperforming existing hybrid quantum MTL models with substantially fewer head parameters. We further demonstrate QMTL’s executability on noisy simulators and real quantum hardware, illustrating its feasibility.

关键词: Quantum Machine Learning, Multi-task Learning, Parameter-efficient, Variational Quantum Circuits, Hybrid Architecture, Task-specific Adaptation, Model Compression, Quantum Hardware

273. ❌ Learning Inference Concurrency in DynamicGate MLP Structural and Mathematical Justification

作者: Yongil Choi 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13546v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究DynamicGate MLP结构，提出一种允许学习和推理并发的神经网络架构，通过分离路由参数和表示参数实现在线适应。虽然论文涉及在线学习和设备端学习系统，但所有关键词都明确针对大语言模型（LLM）或深度学习在科学领域的应用，而本文研究的是通用神经网络架构，未涉及LLM、MoE、量化、推理加速、对齐等具体技术，也未应用于生物信息学等科学领域，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

本文提出DynamicGate MLP神经网络结构，通过分离路由参数和表示参数，实现了学习和推理的并发执行，为在线自适应和设备端学习系统提供了理论基础。

摘要翻译

传统神经网络严格区分学习与推理阶段，因为若在推理过程中更新参数，会导致输出不稳定，甚至使推理函数本身无法明确定义[1, 2, 3]。本文证明，DynamicGate MLP（多层感知机）在结构上允许学习与推理并发执行[4, 5]。其核心思想是将路由（门控）参数与表征（预测）参数分离，使得门控机制能够在线自适应调整，同时保持推理稳定性；或仅选择性地在非活跃子空间内更新权重[4, 5, 6, 7]。我们通过数学形式化提出了实现并发的充分条件，并证明即使在异步或部分参数更新的情况下，每个时间步的推理输出始终可被解释为某个有效模型快照的前向计算结果[8, 9, 10]。这表明DynamicGate MLP能够为在线自适应学习与端侧学习系统提供实用基础[11, 12]。

摘要 (Abstract)

Conventional neural networks strictly separate learning and inference because if parameters are updated during inference, outputs become unstable and even the inference function itself is not well defined [1, 2, 3]. This paper shows that DynamicGate MLP structurally permits learning inference concurrency [4, 5]. The key idea is to separate routing (gating) parameters from representation (prediction) parameters, so that the gate can be adapted online while inference stability is preserved, or weights can be selectively updated only within the inactive subspace [4, 5, 6, 7]. We mathematically formalize sufficient conditions for concurrency and show that even under asynchronous or partial updates, the inference output at each time step can always be interpreted as a forward computation of a valid model snapshot [8, 9, 10]. This suggests that DynamicGate MLP can serve as a practical foundation for online adaptive and on device learning systems [11, 12].

关键词: DynamicGate MLP, learning inference concurrency, online adaptive learning, on-device learning, routing parameters, representation parameters, asynchronous updates, model snapshot

274. ❌ Cross-Layer Co-Optimized LSTM Accelerator for Real-Time Gait Analysis

作者: Mohammad Hasan Ahmadilivani, Levent Aksoy, Mohammad Eslami, Jaan Raik, Alar Kuusik 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13543v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于LSTM神经网络的硬件加速器设计，用于实时步态分析，属于边缘计算和医疗AI应用领域。与大多数关键词（主要涉及大语言模型技术、训练方法、推理优化等）完全无关。仅与’Quantization OR Model Compression OR Low-bit Weights’有一定关联（5分），因为论文提到了硬件感知的位宽优化以降低硬件复杂度；与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），因为步态分析属于生物信息学/医疗AI应用。其他关键词均未涉及，故评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种跨层协同优化的LSTM加速器ASIC设计，用于实时步态分析，通过硬件感知的位宽优化和设计空间探索，在65纳米工艺下实现了高精度检测，速度比应用要求快4.05倍。

摘要翻译

长短期记忆（Long Short-Term Memory, LSTM）神经网络已广泛应用于对实时性要求严格且需边缘计算能力的医疗健康领域。步态分析——通过检测异常步伐以防止患者跌倒——是此类应用中的一个重要课题。鉴于在性能、功耗和面积方面极为严格的设计要求，专用集成电路（Application-Specific Integrated Circuit, ASIC）能够高效实时地利用LSTM进行步态分析，并实现高精度。据我们所知，本研究首次提出了一种面向ASIC设计、跨层协同优化的LSTM加速器，用于实时步态分析。我们从软件层至版图设计层进行了全面的设计空间探索：在软件层面实施硬件感知量化以优化位宽，降低硬件复杂度；在寄存器传输级探索多种设计方案；并通过生成不同版图布局，以在硬件复杂度和精度之间寻求高效的LSTM加速器实现方案。物理综合结果表明，采用65纳米工艺，针对最高精度优化的加速器版图芯片面积为0.325平方毫米，而针对硬件复杂度优化、精度略低的替代设计面积减少了15.4%。此外，所设计的加速器能够以比给定应用要求快4.05倍的速度实现精确的步态异常检测。

摘要 (Abstract)

Long Short-Term Memory (LSTM) neural networks have penetrated healthcare applications where real-time requirements and edge computing capabilities are essential. Gait analysis that detects abnormal steps to prevent patients from falling is a prominent problem for such applications. Given the extremely stringent design requirements in performance, power dissipation, and area, an Application-Specific Integrated Circuit (ASIC) enables an efficient real-time exploitation of LSTMs for gait analysis, achieving high accuracy. To the best of our knowledge, this work presents the first cross-layer co-optimized LSTM accelerator for real-time gait analysis, targeting an ASIC design. We conduct a comprehensive design space exploration from software down to layout design. We carry out a bit-width optimization at the software level with hardware-aware quantization to reduce the hardware complexity, explore various designs at the register-transfer level, and generate alternative layouts to find efficient realizations of the LSTM accelerator in terms of hardware complexity and accuracy. The physical synthesis results show that, using the 65 nm technology, the die size of the accelerator’s layout optimized for the highest accuracy is 0.325 mm^2, while the alternative design optimized for hardware complexity with a slightly lower accuracy occupies 15.4% smaller area. Moreover, the designed accelerators achieve accurate gait abnormality detection 4.05x faster than the given application requirement.

关键词: LSTM accelerator, gait analysis, ASIC design, hardware-aware quantization, real-time processing, cross-layer co-optimization, edge computing, healthcare applications

275. ❌ Robust Low-Rank Tensor Completion based on M-product with Weighted Correlated Total Variation and Sparse Regularization

作者: Biswarup Karmakar, Ratikanta Behera 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13525v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于低秩张量补全的数学优化方法，提出了一种基于M-product的加权相关全变分和稀疏正则化方法，用于处理高维张量数据中的缺失值、异常值和噪声。论文内容属于传统的数值计算、优化算法和信号处理领域，与所有提供的大模型、深度学习、AI应用等关键词完全无关。论文未涉及任何语言模型、模型训练、推理优化、AI代理、科学AI应用等主题。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于M-product的加权相关全变分和稀疏正则化方法，用于鲁棒的低秩张量补全，在图像补全、去噪和背景减除任务中优于现有基准方法。

摘要翻译

鲁棒低秩张量补全问题旨在解决现实应用中普遍存在的、具有缺失条目、异常值和稀疏噪声的受损高维张量数据恢复挑战。现有方法因其对统一正则化方案的依赖而遇到根本性局限，特别是张量核范数和 $\ell_1$ 范数正则化方法，这些方法对所有奇异值和稀疏分量不加区分地施加同等收缩，从而损害了关键张量结构的保留。所提出的张量加权相关全变分（TWCTV）正则化器通过 $M$-积框架解决了这些不足，该框架结合了梯度张量上的加权 Schatten-$p$ 范数以增强低秩性与平滑性，以及用于噪声抑制的加权稀疏分量。所提出的加权方案自适应地降低阈值水平，以同时保留主导奇异值和稀疏分量，从而改善了恢复信号中关键结构元素和细微细节的重建。通过系统化的算法设计，我们引入了一种增强的交替方向乘子法（ADMM），该方法兼具计算效率与理论依据，其收敛性质在 $M$-积框架内得到了全面分析。在图像补全、去噪和背景减除任务上进行的大量数值评估验证了该方法相对于现有基准方法的优越性能。

摘要 (Abstract)

The robust low-rank tensor completion problem addresses the challenge of recovering corrupted high-dimensional tensor data with missing entries, outliers, and sparse noise commonly found in real-world applications. Existing methodologies have encountered fundamental limitations due to their reliance on uniform regularization schemes, particularly the tensor nuclear norm and $\ell_1$ norm regularization approaches, which indiscriminately apply equal shrinkage to all singular values and sparse components, thereby compromising the preservation of critical tensor structures. The proposed tensor weighted correlated total variation (TWCTV) regularizer addresses these shortcomings through an $M$-product framework that combines a weighted Schatten-$p$ norm on gradient tensors for low-rankness with smoothness enforcement and weighted sparse components for noise suppression. The proposed weighting scheme adaptively reduces the thresholding level to preserve both dominant singular values and sparse components, thus improving the reconstruction of critical structural elements and nuanced details in the recovered signal. Through a systematic algorithmic approach, we introduce an enhanced alternating direction method of multipliers (ADMM) that offers both computational efficiency and theoretical substantiation, with convergence properties comprehensively analyzed within the $M$-product framework.Comprehensive numerical evaluations across image completion, denoising, and background subtraction tasks validate the superior performance of this approach relative to established benchmark methods.

关键词: Robust Low-Rank Tensor Completion, M-product, Weighted Correlated Total Variation, Sparse Regularization, Alternating Direction Method of Multipliers, Image Completion, Denoising, Background Subtraction

276. ❌ LEGO-MOF: Equivariant Latent Manipulation for Editable, Generative, and Optimizable MOF Design

作者: Chaoran Zhang, Guangyao Li, Dongxu Ji 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13520v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于金属有机框架（MOF）材料设计的深度学习应用，核心贡献是开发了LinkerVAE（变分自编码器）和测试时优化策略，用于连续结构编辑和性能优化。论文内容与绝大多数关键词（涉及大语言模型技术、训练方法、推理优化、智能体等）完全无关，因为这些关键词特指自然语言处理领域的大模型技术。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，该论文属于AI for Science在材料科学（可视为化学信息学相关领域）的具体应用，因此给予10分（高度相关）。

!!! tip deepseek-chat TL;DR

该研究解决了金属有机框架（MOF）设计空间探索困难的问题，通过提出一个基于SE(3)-等变潜在空间的生成框架，实现了对MOF结构的连续编辑和优化，从而显著提升了二氧化碳捕获性能。

摘要翻译

金属有机框架（MOFs）在碳捕集领域极具应用前景，但其庞大的设计空间仍难以有效探索。现有的深度生成模型虽能实现MOF的从头设计，但主要作为前馈式结构生成器。这些方法严重依赖预定义的构建单元库和不可微的后优化过程，本质上切断了连续结构编辑所需的信息流。本文提出一种以连续结构操作为核心的目标驱动生成框架。其关键组件LinkerVAE可将离散的三维化学图映射至连续且SE(3)等变的潜空间。这一平滑流形实现了几何感知的结构操控，包括隐式化学风格迁移和零样本等网状扩展。在此基础上，我们引入测试时优化策略，利用精确的代理模型在潜空间中持续优化现有MOF的图结构，以定向提升目标性能。该方法系统性地增强了碳捕集性能，在严格保持结构有效性的前提下，使纯CO2吸附量平均相对提升达147.5%。结合潜扩散模型与刚体组装技术实现完整MOF构建，本框架为功能材料的自动化发现、定向优化与编辑建立了一条可扩展、完全可微的技术路径。

摘要 (Abstract)

Metal-organic frameworks (MOFs) are highly promising for carbon capture, yet navigating their vast design space remains challenging. Recent deep generative models enable de novo MOF design but primarily act as feed-forward structure generators. By heavily relying on predefined building block libraries and non-differentiable post-optimization, they fundamentally sever the information flow required for continuous structural editing. Here, we propose a target-driven generative framework focused on continuous structural manipulation. At its core is LinkerVAE, which maps discrete 3D chemical graphs into a continuous, SE(3)-equivariant latent space. This smooth manifold unlocks geometry-aware manipulations, including implicit chemical style transfer and zero-shot isoreticular expansion. Building upon this, we introduce a test-time optimization (TTO) strategy, utilizing an accurate surrogate model to continuously optimize the latent graphs of existing MOFs toward desired properties. This approach systematically enhances carbon capture performance, achieving a striking average relative boost of 147.5% in pure CO2 uptake while strictly preserving structural validity. Integrated with a latent diffusion model and rigid-body assembly for full MOF construction, our framework establishes a scalable, fully differentiable pathway for both the automated discovery, targeted optimization and editing of functional materials.

关键词: Metal-organic frameworks, Generative model, SE(3)-equivariant latent space, Continuous structural manipulation, Test-time optimization, Carbon capture, LinkerVAE, Latent diffusion model

277. ❌ Joint Representation Learning and Clustering via Gradient-Based Manifold Optimization

作者: Sida Liu, Yangzi Guo, Mingyuan Wang 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13484v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究传统的机器学习聚类和降维联合优化问题，使用流形优化方法，应用于模拟数据和MNIST图像数据集。所有关键词均与大语言模型、深度学习技术原理创新或科学领域AI应用相关，而该论文完全不涉及这些主题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于梯度流形优化的联合降维与聚类框架，在模拟数据和MNIST数据集上取得了比现有聚类算法更好的性能。

摘要翻译

聚类与降维一直是机器学习和计算机视觉领域的核心议题。长期以来，高维数据的聚类因维度灾难问题而面临挑战。因此，一个更具前景的方向是联合学习降维与聚类。本研究提出了一种流形学习框架，能够同时进行降维和聚类学习。该框架可联合学习降维技术（如线性投影或神经网络）的参数，并基于所得特征（例如在高斯混合模型框架下）对数据进行聚类。该框架通过梯度流形优化在流形上搜索降维参数与最优聚类分配。我们以高斯混合模型作为简洁高效的示例来展示该框架，其过程在某种程度上类似于无监督线性判别分析。我们将所提方法应用于模拟数据的无监督训练以及基准图像数据集（即MNIST）的实验。结果表明，本算法相较于文献中常见的聚类算法具有更优的性能。

摘要 (Abstract)

Clustering and dimensionality reduction have been crucial topics in machine learning and computer vision. Clustering high-dimensional data has been challenging for a long time due to the curse of dimensionality. For that reason, a more promising direction is the joint learning of dimension reduction and clustering. In this work, we propose a Manifold Learning Framework that learns dimensionality reduction and clustering simultaneously. The proposed framework is able to jointly learn the parameters of a dimension reduction technique (e.g. linear projection or a neural network) and cluster the data based on the resulting features (e.g. under a Gaussian Mixture Model framework). The framework searches for the dimension reduction parameters and the optimal clusters by traversing a manifold,using Gradient Manifold Optimization. The obtained The proposed framework is exemplified with a Gaussian Mixture Model as one simple but efficient example, in a process that is somehow similar to unsupervised Linear Discriminant Analysis (LDA). We apply the proposed method to the unsupervised training of simulated data as well as a benchmark image dataset (i.e. MNIST). The experimental results indicate that our algorithm has better performance than popular clustering algorithms from the literature.

关键词: clustering, dimensionality reduction, manifold learning, gradient manifold optimization, Gaussian Mixture Model, unsupervised learning, joint learning, MNIST dataset

278. ❌ Computational framework for multistep metabolic pathway design

作者: Peter Zhiping Zhang, Jeffrey D. Varner 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13471v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于使用深度学习进行代谢通路设计的计算框架开发，属于生物信息学/计算生物学领域。论文明确提到使用深度学习（神经网络）进行反应分类和通路排名，这直接与关键词’AI for Science OR Bioinformatics OR Cheminformatics’高度相关，因此给予10分。然而，论文的核心是特定领域的深度学习应用，并未涉及大语言模型（LLMs）、模型架构创新（如MoE）、训练对齐技术（如RLHF、SFT）、推理优化、智能体系统或通用AI技术原理等。所有其他关键词均与论文内容无直接关联，因此评分为0分。

!!! tip deepseek-chat TL;DR

该研究开发了一个结合深度学习与传统逆生物合成工作流程的计算框架，用于改进多步代谢通路的设计，并通过计算成功复现了部分天然和非天然通路。

摘要翻译

计算机模拟工具对于生成新假设及探索全新代谢通路设计的替代方案具有重要意义。然而，尽管已有许多逆生物合成计算框架被提出，但文献中算法指导的异源生物化学逆合成成功案例仍鲜有报道。深度学习技术已显著提升了有机化学领域合成与逆合成的质量。受此进展启发，我们探索将生物化学转化的深度学习与传统逆生物合成工作流相结合，以改进计算机模拟的合成代谢通路设计。为构建计算生物合成通路设计框架，我们从公共数据库中整合了代谢反应与酶促反应模板数据。通过采用文献中改编的数据增强流程，利用酶促反应模板生成的人工代谢反应对整合的反应数据集进行了扩充。我们训练了两个基于神经网络的通路排序模型作为二元分类器，用以区分整合反应与人工生成反应；每个模型输出一个标量，用于量化单步或双步通路的合理性。将这两个模型与酶促反应模板结合，我们构建了一个多步逆生物合成流程，并通过计算机模拟重现部分天然与非天然通路对其进行了验证。

摘要 (Abstract)

In silico tools are important for generating novel hypotheses and exploring alternatives in de novo metabolic pathway design. However, while many computational frameworks have been proposed for retrobiosynthesis, few successful examples of algorithm-guided xenobiotic biochemical retrosynthesis have been reported in the literature. Deep learning has improved the quality of synthesis and retrosynthesis in organic chemistry applications. Inspired by this progress, we explored combining deep learning of biochemical transformations with the traditional retrobiosynthetic workflow to improve in silico synthetic metabolic pathway designs. To develop our computational biosynthetic pathway design framework, we assembled metabolic reaction and enzymatic template data from public databases. A data augmentation procedure, adapted from literature, was carried out to enrich the assembled reaction dataset with artificial metabolic reactions generated by enzymatic reaction templates. Two neural network-based pathway ranking models were trained as binary classifiers to distinguish assembled reactions from artificial counterparts; each model output a scalar quantifying the plausibility of a 1-step or 2-step pathway. Combining these two models with enzymatic templates, we built a multistep retrobiosynthesis pipeline and validated it by reproducing some natural and non-natural pathways computationally.

关键词: metabolic pathway design, retrobiosynthesis, deep learning, neural network, biochemical transformations, in silico, computational framework, data augmentation

279. ❌ Universality of Gaussian-Mixture Reverse Kernels in Conditional Diffusion

作者: Nafiz Ishtiaque, Syed Arefinul Haque, Kazi Ashraful Alam, Fatima Jahara 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13470v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究条件扩散模型的理论性质，特别是高斯混合反向核的普适性，属于生成模型的理论分析范畴。所有评分关键词均针对大语言模型（LLMs）及其相关技术（如训练、对齐、推理、应用等），而本文完全不涉及语言模型或自然语言处理，专注于扩散模型的数学理论证明，因此与所有关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文证明了具有ReLU网络对数的高斯混合反向核的条件扩散模型能够在条件KL散度下任意逼近正则目标分布，并揭示了神经反向核类在精确终端匹配下的稠密性。

摘要翻译

我们证明，当条件扩散模型的反向核是具有ReLU网络逻辑值的有限高斯混合时，该模型能够在上下文平均的条件KL散度下任意逼近具有适当正则性的目标分布，仅存在一个不可约的终端失配项，且该失配项通常随扩散时长的增加而消失。通过路径空间分解，输出误差可归结为此终端失配与每一步反向核误差之和；假设每个反向核通过有限维特征映射进行分解，则每一步转化为静态条件密度逼近问题，该问题可通过将Norets的高斯混合理论与定量的ReLU边界相结合来解决。在精确终端匹配条件下，所得的神经反向核类在条件KL散度意义下是稠密的。

摘要 (Abstract)

We prove that conditional diffusion models whose reverse kernels are finite Gaussian mixtures with ReLU-network logits can approximate suitably regular target distributions arbitrarily well in context-averaged conditional KL divergence, up to an irreducible terminal mismatch that typically vanishes with increasing diffusion horizon. A path-space decomposition reduces the output error to this mismatch plus per-step reverse-kernel errors; assuming each reverse kernel factors through a finite-dimensional feature map, each step becomes a static conditional density approximation problem, solved by composing Norets’ Gaussian-mixture theory with quantitative ReLU bounds. Under exact terminal matching the resulting neural reverse-kernel class is dense in conditional KL.

关键词: conditional diffusion models, Gaussian-mixture reverse kernels, ReLU-network logits, KL divergence, universality, neural reverse-kernel, density approximation, theoretical analysis

280. ❌ Adaptive Unknown Fault Detection and Few-Shot Continual Learning for Condition Monitoring in Ultrasonic Metal Welding

作者: Ahmadreza Eslaminia, Kuan-Chieh Lu, Klara Nahrstedt, Chenhui Shao 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13465v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于超声波金属焊接（UMW）中的自适应状态监测，提出了一种用于未知故障检测和少样本持续学习的方法。论文的核心技术是使用多层感知器（MLP）的隐藏层表示、统计阈值策略、余弦相似度变换和聚类算法，以及选择性更新网络最后层的持续学习过程。所有关键词（除了最后一个）都专门针对大语言模型（LLM）及其相关技术（如MoE、缩放定律、训练方法、推理优化、代理、工具使用等）。论文完全没有涉及LLM、基础模型或任何自然语言处理技术。它属于传统的机器学习/深度学习在工业制造（具体是焊接过程监控）中的应用。因此，除了“AI for Science OR Bioinformatics OR Cheminformatics”可以得5分（因为论文将AI应用于制造工程，这可以广义地视为“AI for Science”的一个子领域，尽管不是生物信息学或化学信息学），其他所有关键词都得0分。论文的创新点在于工业过程监控中的自适应学习，而非大模型技术本身。

!!! tip deepseek-chat TL;DR

该论文提出了一种用于超声波金属焊接的自适应状态监测方法，通过未知故障检测和少样本持续学习，实现了对未知故障96%的检测准确率，并在仅用5个标注样本学习新故障后，分类准确率达到98%。

摘要翻译

超声波金属焊接（Ultrasonic Metal Welding, UMW）在工业应用中广泛使用，但其对工具磨损、表面污染和材料变异性较为敏感，可能导致意外的工艺故障和焊接质量不达标。传统的监测系统通常依赖于监督学习模型，这些模型假设所有故障类型均预先已知，限制了其处理先前未知工艺故障的能力。为应对这一挑战，本文提出了一种自适应状态监测方法，能够实现超声波金属焊接中的未知故障检测与少样本持续学习。通过分析多层感知机的隐藏层表示并利用统计阈值策略，可检测出未知故障。一旦检测到未知故障，来自未知故障类型的样本将通过持续学习过程被整合到现有模型中，该过程选择性地仅更新网络的最后几层，从而使模型在保持对现有类别认知的同时，能够识别新的故障类型。为加速标注过程，余弦相似度变换结合聚类算法对相似的未知样本进行分组，从而减少人工标注工作量。使用多传感器超声波金属焊接数据集的实验结果表明，所提方法在检测未知故障状态时达到了96%的准确率，同时保持了对已知类别的可靠分类。仅使用五个标注样本纳入新故障类型后，更新后的模型实现了98%的测试分类准确率。这些结果表明，所提方法能够以最小的再训练成本和时间实现自适应监测。该方法为状态监测中的持续学习提供了一个可扩展的解决方案，适用于新的工艺条件可能随时间不断出现的场景，并可扩展至其他制造过程。

摘要 (Abstract)

Ultrasonic metal welding (UMW) is widely used in industrial applications but is sensitive to tool wear, surface contamination, and material variability, which can lead to unexpected process faults and unsatisfactory weld quality. Conventional monitoring systems typically rely on supervised learning models that assume all fault types are known in advance, limiting their ability to handle previously unseen process faults. To address this challenge, this paper proposes an adaptive condition monitoring approach that enables unknown fault detection and few-shot continual learning for UMW. Unknown faults are detected by analyzing hidden-layer representations of a multilayer perceptron and leveraging a statistical thresholding strategy. Once detected, the samples from unknown fault types are incorporated into the existing model through a continual learning procedure that selectively updates only the final layers of the network, which enables the model to recognize new fault types while preserving knowledge of existing classes. To accelerate the labeling process, cosine similarity transformation combined with a clustering algorithm groups similar unknown samples, thereby reducing manual labeling effort. Experimental results using a multi-sensor UMW dataset demonstrate that the proposed method achieves 96% accuracy in detecting unseen fault conditions while maintaining reliable classification of known classes. After incorporating a new fault type using only five labeled samples, the updated model achieves 98% testing classification accuracy. These results demonstrate that the proposed approach enables adaptive monitoring with minimal retraining cost and time. The proposed approach provides a scalable solution for continual learning in condition monitoring where new process conditions may constantly emerge over time and is extensible to other manufacturing processes.

关键词: Ultrasonic Metal Welding, Condition Monitoring, Unknown Fault Detection, Few-Shot Learning, Continual Learning, Adaptive Monitoring, Multi-sensor Data, Manufacturing Process

281. ❌ Pareto-Optimal Offline Reinforcement Learning via Smooth Tchebysheff Scalarization

作者: Aadyot Bhatnagar, Peter Mørch Groth, Ali Madani 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.13175v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）在蛋白质工程中的多目标对齐问题，直接涉及LLMs、对齐、DPO扩展、后训练和科学AI应用等关键词，这些评分为10分；其他关键词如MoE、量化、推理加速等未在论文中涉及，评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为STOMP的新型离线强化学习算法，通过平滑Tchebysheff标量化方法解决大语言模型在多目标对齐中的局限性，并在蛋白质工程任务中验证了其优于现有方法的性能。

摘要翻译

大型语言模型可通过在小规模标注数据集上进行离线强化学习来与人类偏好对齐。虽然单目标对齐已得到充分研究，但许多实际应用需要同时优化多个相互冲突的奖励目标，例如在蛋白质工程中需兼顾催化活性和特异性，或在聊天机器人中需平衡帮助性与无害性。先前研究主要依赖线性奖励标量化方法，但该方法已被证明无法恢复帕累托前沿的非凸区域。本文并未直接对奖励进行标量化，而是将多目标强化学习本身构建为一个可通过平滑切比雪夫标量化进行优化的数学问题——这是一种能克服线性标量化缺陷的新近技术。基于此框架，我们提出了多目标偏好平滑切比雪夫优化算法，这是一种新颖的离线强化学习算法。该算法通过基于观测分布对个体奖励进行标准化处理，将直接偏好优化方法以理论严谨的方式扩展至多目标场景。我们在系列蛋白质工程任务上对STOMP算法进行了实证验证：通过对三个自回归蛋白质语言模型在三个实验室蛋白质适应性数据集上进行对齐训练。与前沿基线方法相比，根据离线离策略评估和生成式评估的综合结果，STOMP在九种实验设置中的八种取得了最高的超体积指标。由此证明，STOMP是一种强大且稳健的多目标对齐算法，能够有效提升后训练模型在多属性蛋白质优化及其他领域的性能。

摘要 (Abstract)

Large language models can be aligned with human preferences through offline reinforcement learning (RL) on small labeled datasets. While single-objective alignment is well-studied, many real-world applications demand the simultaneous optimization of multiple conflicting rewards, e.g. optimizing both catalytic activity and specificity in protein engineering, or helpfulness and harmlessness for chatbots. Prior work has largely relied on linear reward scalarization, but this approach provably fails to recover non-convex regions of the Pareto front. In this paper, instead of scalarizing the rewards directly, we frame multi-objective RL itself as an optimization problem to be scalarized via smooth Tchebysheff scalarization, a recent technique that overcomes the shortcomings of linear scalarization. We use this formulation to derive Smooth Tchebysheff Optimization of Multi-Objective Preferences (STOMP), a novel offline RL algorithm that extends direct preference optimization to the multi-objective setting in a principled way by standardizing the individual rewards based on their observed distributions. We empirically validate STOMP on a range of protein engineering tasks by aligning three autoregressive protein language models on three laboratory datasets of protein fitness. Compared to state-of-the-art baselines, STOMP achieves the highest hypervolumes in eight of nine settings according to both offline off-policy and generative evaluations. We thus demonstrate that STOMP is a powerful, robust multi-objective alignment algorithm that can meaningfully improve post-trained models for multi-attribute protein optimization and beyond.

关键词: offline reinforcement learning, multi-objective alignment, large language models, protein engineering, direct preference optimization, Tchebysheff scalarization, autoregressive protein language models, Pareto front

282. ❌ Configuration interaction extension of AGP for incorporating inter-geminal correlations

作者: Airi Kawasaki, Fei Gao, Gustavo E. Scuseria 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.14115v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于量子化学计算中的波函数方法（AGP-CI），属于计算化学领域，与所有大模型、深度学习、AI技术原理相关的关键词均无直接关联。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文涉及化学分子（H2O, N2）的计算模拟，属于计算化学/化学信息学范畴，但论文本身并未使用AI或机器学习方法，而是传统的量子化学计算方法，因此给予5分（有一定关联但非核心）。

!!! tip deepseek-chat TL;DR

该研究提出了一种基于反对称化双子幂组态相互作用（AGP-CI）的波函数方法，通过引入组态相互作用扩展了AGP框架以纳入双子间关联，并在Hubbard模型和H2O、N2分子上验证了其高精度和优于线性组合AGP方法的性能。

摘要翻译

本文提出了一类反对称化双粒子组态幂组态相互作用（antisymmetrized geminal power configuration interaction, AGP-CI）波函数，该波函数通过在AGP框架中引入组态相互作用展开以纳入双粒子组间的关联效应。为使这些波函数在计算上易于处理，我们通过将AGP-CI拟设重写为AGP的线性组合（LC-AGP）来对其进行评估，从而可利用标准AGP方法计算其重叠积分与哈密顿矩阵元。受边界秩分解的启发，我们进一步将该拟设重组为依赖于小变形参数$τ$的紧凑型AGP线性组合，该参数控制截断展开逼近完整AGP-CI态的程度。通过对哈伯德模型以及小分子H$_2$O和N$_2$的基准测试表明，所提出的波函数能够持续实现高精度，其表现优于LC-AGP方法，尤其在电子数较多和强关联体系中效果更为显著。

摘要 (Abstract)

In this paper, we develop a class of antisymmetrized geminal power configuration interaction (AGP-CI) wave functions that extend the AGP framework by incorporating inter-geminal correlations through a CI expansion. To make these wavefunctions computationally tractable, we evaluate them by rewriting the AGP-CI ansatz as a linear combination of AGPs (LC-AGP), for which overlaps and Hamiltonian matrix elements can be computed with standard AGP machinery. Motivated by border-rank decompositions, we further reorganize this ansatz into a compact linear combination of AGPs depending on a small deformation parameter $τ$, which controls how closely the truncated expansion approximates the full AGP-CI state. Benchmark applications to the Hubbard model and to the small molecules H$_2$O and N$_2$ demonstrate that the proposed wavefunctions achieve consistently high accuracy and outperform the LC-AGP, particularly for systems with more electrons and in strongly correlated regimes.

关键词: AGP-CI, antisymmetrized geminal power, configuration interaction, inter-geminal correlations, Hubbard model, wave function, quantum chemistry, strongly correlated systems

283. ❌ Critical point search and linear response theory for computing electronic excitation energies of molecular systems. Part II. CASSCF

作者: Laura Grazioli, Yukuan Hu, Tommaso Nottoli, Filippo Lipparini, Eric Cancès 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13753v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于量子化学领域的CASSCF方法，用于计算分子系统的电子激发能，属于计算化学的理论和算法研究。所有关键词均与大模型、深度学习、AI技术原理或应用相关，而本文完全不涉及这些主题。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于科学计算领域，但并未使用AI或机器学习方法，而是基于传统的量子化学理论，因此给予5分（有一定关联）。其他关键词均与论文内容完全无关，得0分。

!!! tip deepseek-chat TL;DR

该论文扩展了Kähler流形形式到CASSCF理论，建立了从时变CASSCF方程到激发态状态特定和线性响应方法的几何联系，并开发了一种仅依赖一阶导数的稳健状态特定方法，成功应用于水、甲醛和乙烯等分子系统。

摘要翻译

在完全活性空间自洽场（CASSCF）框架内计算激发态，无论在理论还是算法层面，始终是量子化学领域的重大挑战。本工作中，我们将本系列第一部分引入的Kähler流形形式体系扩展至CASSCF理论，并从几何角度建立了含时CASSCF方程与激发态的状态特定（state-specific）及线性响应方法之间的联系。这一联系的确立，首先通过探究CASSCF流形的底层结构并识别其Kähler结构来实现，该结构因组态相互作用（CI）自由度与轨道自由度之间的非平凡耦合而变得复杂。基于这些理论发现，我们以直接的方式推导了CASSCF线性响应方程，并发展了一种仅依赖于CASSCF能量泛函一阶导数的稳健状态特定方法。通过对代表性分子体系——水、甲醛和乙烯——的数值计算，结果验证了所提出的状态特定方法的有效性，同时也揭示了由于CASSCF理论引入的非线性所导致的可靠识别激发态的困难。

摘要 (Abstract)

The computation of excited states within the Complete Active Space Self-Consistent Field (CASSCF) framework remains a significant challenge in quantum chemistry, both theoretically and algorithmically. In this work, we extend the Kähler manifold formalism introduced in Part I of this series to the CASSCF theory, and draw a geometrical connection from the time-dependent CASSCF equations to state-specific and linear response methodologies for excited states. This is achieved by first investigating the underlying CASSCF manifold and identifying its Kähler structure, which is complicated by the nontrivial coupling of CI and orbital degrees of freedom. Building on these theoretical findings, we derive the CASSCF linear response equations in a straightforward manner, and develop a robust state-specific method that relies solely on first-order derivatives of the CASSCF energy functional. Numerical results on representative molecular systems-water, formaldehyde, and ethylene-demonstrate the effectiveness of the proposed state-specific method, while revealing the difficulty of reliable identification of excited states due to nonlinearity induced by the CASSCF theory.

关键词: CASSCF, excited states, Kähler manifold, linear response theory, state-specific method, quantum chemistry, electronic excitation energies, molecular systems

284. ❌ Scalable framework for quantum transport across large physical networks

作者: Adam Burgess, Nicholas Werren, Erik M. Gauger 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13704v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于量子物理领域的计算建模方法（变分极化子框架和分区方案），用于模拟量子能量传输系统。所有评分关键词均涉及大语言模型（LLMs）和深度学习技术，而论文内容完全不涉及这些主题，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种高效的分区方案，解决了变分极化子框架在模拟大规模量子能量传输系统时的可扩展性问题，使其能够处理包含数百至数千个位点的系统。

摘要翻译

精确建模多体量子输运系统在概念和计算层面均面临挑战，这源于希尔伯特空间的指数增长以及大多数自然网络中存在的几何结构与耦合的多尺度特性。此类系统的复杂性还在于环境通常对输运动力学起着关键作用。利用变分幺正变换来位移环境自由度，可以构建能够捕捉中度和强耦合系统动力学的二阶主方程，这类系统在微观能量输运体系中普遍存在。然而，由于求解变分参数所需的自洽方程组极为复杂，该方法的直接实现存在根本性的可扩展性问题。本文提出一种高效的分区方案，该方案利用了自然能量输运网络固有的多尺度特性。这使得变分极化子框架能够扩展到包含数百至数千个位点的量子能量输运系统。我们的工作为探索大型输运网络（例如存在于光捕获复合体中的网络以及无序半导体中的激子输运）提供了物理上可行的研究路径。

摘要 (Abstract)

Accurately modelling many-body quantum transport systems poses a challenge both conceptually and computationally due to the growth of the Hilbert space and the multi-scale nature of the geometries and couplings present in most naturally occurring networks. A compounding complexity of such systems is that the environment typically plays a key role in the transport dynamics. Utilising variational unitary transformations that displace environmental degrees of freedom allows for the deployment of a second-order master equation capable of capturing the dynamics of intermediate and strongly coupled systems, which are ubiquitous in microscopic energy transport systems. However, direct implementations of this approach suffer from fundamental scalability issues due to the complexity of the self-consistent equations required to solve for the variational parameters. Here, we present an efficient partitioning scheme that leverages the inherent multi-scale nature of natural energy transport networks. This enables scaling of the variational polaron framework to quantum energy transport systems, constituting hundreds to thousands of sites. Our work unlocks the physically motivated exploration of large transport networks, for example, those present within light-harvesting complexes and exciton transport in disordered semiconductors.

关键词: quantum transport, variational polaron framework, scalability, partitioning scheme, many-body systems, energy transport networks, master equation, light-harvesting complexes

285. ❌ Ion-Specific Anomalous Water Diffusion in Aqueous Electrolytes: A Machine-Learned Many-Body Force Field Study with MACE

作者: Massimo Ciacchi, Ilnur Saitov, Nico Di Fonte, Isabella Daidone, Carlo Pierleoni 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13659v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文研究电解质溶液中水扩散的离子特异性异常现象，使用基于MACE等变图神经网络的机器学习力场进行分子动力学模拟。论文核心是机器学习在分子模拟和化学物理领域的应用，属于"AI for Science"范畴，与"AI for Science OR Bioinformatics OR Cheminformatics"高度相关（10分）。但论文完全不涉及大语言模型（LLMs）、模型训练技术（如MoE、Scaling Laws、Pre-training等）、推理优化（如Quantization、Speculative Decoding）、对齐技术（如RLHF、Alignment）、智能体（LLM Agents）或其他大模型相关技术，因此其他所有关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文使用机器学习力场（MACE框架）研究电解质溶液中水扩散的离子特异性异常现象，发现NaCl溶液中水扩散被抑制而CsI溶液中水扩散增强，并通过微观机制分析解释了这一现象。

摘要翻译

电解质溶液中水的动力学表现出一种显著的离子特异性异常现象：在离液序列较高的CsI溶液中，水的扩散系数相对于纯液体有所增强，而在亲液序列较高的NaCl溶液中则受到抑制。这一现象长期以来对基于经典力场的分子动力学模拟构成挑战。本研究采用经典分子动力学模拟，结合基于MACE等变图神经网络框架训练的多体机器学习力场（MLFF）对此进行了探究。该力场通过密度泛函理论层面的能量、力和应力数据进行训练，其中交换-关联泛函采用revPBE-D3，该泛函在水体系的计算精度与效率之间提供了可靠平衡。在环境条件下，对浓度范围为0.89–3.56 mol/kg的NaCl和CsI水溶液进行的模拟，重现了实验观测到的异常扩散现象，并显示出相较于先前使用相同理论训练、基于DeePMD框架所得结果的定量改进，尤其在NaCl溶液中更为明显。这种改进可归因于第一水合层中更强的Na⁺–水相互作用，以及Na⁺第二水合层不可忽视的阻滞效应。对于CsI溶液，水的加速被证明主要由阴离子I⁻驱动，其弥散且弱结构化的水合层促进了与本体水的快速交换。通过对时间相关的水扩散系数及离子-氧平均力势进行壳层分解分析，这些结果得到了合理解释，从而为所研究水系电解质中的加速-阻滞机制提供了一幅连贯的微观图像。

摘要 (Abstract)

The dynamics of water in electrolyte solutions exhibits a striking, ion-specific anomaly: the diffusion coefficient of water is enhanced relative to the neat liquid in chaotropic CsI solutions, yet suppressed in kosmotropic NaCl solutions. This phenomenon, long challenging for classical force-field-based molecular dynamics, is studied here using classical molecular dynamics simulations with a many-body machine-learned force field (MLFF) trained within the MACE equivariant graph neural network framework. The force field is trained on energies, forces, and stresses computed at the density functional theory level with the revPBE-D3 exchange–correlation functional, which provides a reliable balance between accuracy and computational efficiency for aqueous systems. Simulations of NaCl and CsI aqueous solutions at ambient conditions over a concentration range of 0.89–3.56~mol/kg reproduce the experimentally observed anomalous diffusion and show a quantitative improvement over previous results obtained with the DeePMD framework, trained on the same theory, particularly for NaCl solutions. This improvement is traced to a stronger Na$^{+}$–water interaction in the first hydration shell and the non-negligible retarding contribution of the second hydration shell of Na$^{+}$. For CsI solutions, the water acceleration is shown to be primarily driven by the anion I$^{-}$, whose diffuse and weakly structured hydration shell facilitates rapid water exchange with the bulk. These results are rationalised through a shell-decomposition analysis of time-dependent water diffusivities and ion–oxygen potentials of mean force providing a coherent microscopic picture of the acceleration–retardation mechanism in the studied aqueous electrolytes.

关键词: machine-learned force field, MACE, molecular dynamics, aqueous electrolytes, water diffusion, ion-specific anomaly, hydration shells, density functional theory

286. ❌ Free energy differences and coexistence of clathrate structures II and H via lattice-switch Monte Carlo

作者: Olivia S. Moro, Nigel B. Wilding, Vincent Ballenegger 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.13249v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文是关于气体水合物（clathrate hydrates）的分子模拟研究，具体涉及蒙特卡洛方法计算自由能差和相共存参数。论文内容完全属于计算化学和统计物理领域，专注于物理系统的模拟方法开发和应用。所有评分关键词均与大语言模型、深度学习、人工智能技术相关，而本文研究的是物理化学系统中的分子模拟方法，两者领域完全不同，没有任何技术、方法或概念上的重叠。因此所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文开发了一种基于晶格切换蒙特卡洛的模拟方法，用于计算不同化学计量比的气体水合物结构之间的自由能差，并确定其相共存参数，应用于氩和甲烷体系时得到的共存压力与实验数据吻合良好。

摘要翻译

我们提出一种模拟技术，用于计算两种不同化学计量比的水合物结构在给定压力下与气体分子储层相连时的自由能差。该方法可在两种水合物结构具有相同水分子数$N_w$时，确定该系统的共存参数。该技术基于等压晶格切换蒙特卡洛模拟，以测量气体分子完全占据或完全空置时两种水合物结构间的自由能差。该测量结果与在客体分子数$N_g$受化学势$μ_g$调控而波动的系综中的热力学积分相结合。我们分析了由此产生的恒定$N_w,μ_g,P,T$系综的性质，并展示了如何通过热力学循环计算共存点。将本方法应用于氩气和甲烷水合物结构时，我们发现计算得到的共存压力与现有实验数据总体吻合良好。

摘要 (Abstract)

We introduce a simulation technique to compute the free energy difference between two hydrate structures of different stoichiometry connected to a reservoir of gas molecules at a prescribed pressure. The method permits the determination of coexistence parameters for the system when the two hydrate structures have the same number of water molecules $N_w$. The approach is based on performing isobaric Lattice Switch Monte Carlo simulations to measure free energy differences between the hydrate structures when they are either fully occupied by gas molecules, or fully empty. This measurement is combined with thermodynamic integration within an ensemble in which the number of guest molecules $N_g$ can fluctuate under the control of a chemical potential $μ_g$. We analyze the properties of the resulting constant-$N_w,μ_g,P,T$ ensemble and show how it can be used to calculate coexistence points via a thermodynamic cycle. Applying the method to argon and methane structures, we find coexistence pressures that are in good agreement overall with the available experimental data.

关键词: free energy difference, clathrate hydrates, lattice-switch Monte Carlo, coexistence parameters, thermodynamic integration, argon, methane, gas hydrates

287. ❌ Excited-State Quantum Chemistry on Qumode-Based Processors via Variational Quantum Deflation

作者: Marlon F. Jost, Sijia S. Dong 期刊/来源: arxiv 发布日期: 2026-04-15 arXiv链接: http://arxiv.org/abs/2604.13457v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于量子计算在量子化学中的应用，特别是基于qumode的变分量子算法，用于计算电子和振动激发态能量。论文内容与深度学习、大模型技术完全无关，所有关键词中只有"AI for Science OR Bioinformatics OR Cheminformatics"与论文的科学计算应用有一定关联，但论文使用的是量子计算方法而非AI方法，因此相关性较低。其他关键词均涉及大模型、深度学习、训练技术、推理优化、代理系统等，与论文的量子化学计算主题无直接关系。

!!! tip deepseek-chat TL;DR

该论文提出了基于qumode的变分量子紧缩框架（QumVQD），用于在玻色子量子处理器上计算分子电子和振动激发态能量，实现了比基于qubit的算法更低的计算开销和更强的误差鲁棒性。

摘要翻译

基于玻色量子处理器的变分量子算法是量子化学计算的新兴范式，它利用了分子结构与基于谐振子硬件之间的自然匹配性。我们提出了基于量子模态的变分量子紧缩框架（QumVQD），用于在量子模态架构上求解电子激发态和振动激发态能量。针对电子结构问题，我们通过福克基汉明权重滤波引入了粒子数守恒约束。这种对称性强制显著降低了计算开销，将希尔伯特空间维度从$M$个自旋轨道和$n_e$个电子对应的O$(2^M)$缩减至O$M \choose n_e$。我们通过对H${\text{2}}$分子进行电子结构计算验证了该方法，在使用STO-3G基组时，其势能面计算结果在全组态相互作用（FCI）精度范围内与化学精度一致。拓展至振动结构领域，我们将QumVQD与基于博戈留波夫变换的哈密顿量分片方法相结合，计算了CO${\text{2}}$和H$_{\text{2}}$S分子的振动本征态，达到光谱精度，且所需纠缠门数量比基于量子比特的同类算法降低1-2个数量级。我们通过振幅阻尼模型和门保真度分析进行了噪声表征，结果表明相较于量子比特算法，该方案因电路深度降低而展现出更强的误差抵抗能力。这些成果共同凸显了玻色量子设备在推进计算化学发展方面的潜力，特别是在量子比特设备面临挑战的领域。

摘要 (Abstract)

Variational quantum algorithms on bosonic quantum processors are an emerging paradigm for quantum chemistry calculations, exploiting the natural alignment between molecular structure and harmonic oscillator-based hardware. We introduce the qumode-based variational quantum deflation framework (QumVQD) for finding both electronic and vibrational excited state energies on qumode-based architectures. For electronic structure, we incorporated particle number conservation constraints via Fock basis Hamming weight filtering. This symmetry enforcement achieves a significant reduction in computational overhead, scaling the Hilbert space dimension as O$M \choose n_e$ rather than O$(2^M)$ for $M$ spin orbitals and $n_e$ electrons. We validate the approach through electronic structure calculations on H${\text{2}}$, achieving agreement with full configuration interaction (FCI) using the STO-3G basis within chemical accuracy across potential energy surfaces. Extending to vibrational structure, we combine QumVQD with Hamiltonian fragmentation based on Bogoliubov transforms, computing CO${\text{2}}$ and H$_{\text{2}}$S vibrational eigenstates to spectroscopic accuracy with entangling gate counts 1-2 orders of magnitude lower than analogous qubit-based algorithms. We performed noise characterization using amplitude-damping models and gate-fidelity analysis, which demonstrates enhanced error resilience due to reduced circuit depth compared to qubit-based algorithms. Together, these results highlight the potential of bosonic quantum devices for advancing computational chemistry, particularly in areas where qubit-based devices struggle.

关键词: variational quantum algorithms, quantum chemistry, excited-state energies, qumode-based processors, bosonic quantum devices, vibrational structure, Hamiltonian fragmentation, error resilience

288. ❌ Rare Event Analysis via Stochastic Optimal Control

作者: Yuanqi Du, Jiajun He, Dinghuai Zhang, Eric Vanden-Eijnden, Carles Domingo-Enrich 期刊/来源: arxiv 发布日期: 2026-04-14 arXiv链接: http://arxiv.org/abs/2604.13213v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于物理系统（如生物分子构象变化、相变、化学反应）中罕见事件的计算分析，提出了一种基于随机最优控制（SOC）的框架来估计committor函数并高效采样反应路径。所有关键词（共27个）中，只有"AI for Science OR Bioinformatics OR Cheminformatics"与论文有一定关联（评5分），因为论文涉及生物分子和化学反应的模拟，属于AI在科学领域的应用，但并未明确使用大模型或深度学习技术。其他26个关键词均与大模型、深度学习技术原理或具体方法（如MoE、RLHF、RAG等）相关，而论文的核心是随机最优控制、过渡路径理论和物理系统模拟，与这些技术完全无关，因此评0分。加权总分计算为：5.0（AI for Science相关度）× 1.0（权重）= 5.0分，远低于动态及格分26.6分，表明论文与评审关注的大模型和深度学习主题高度不匹配。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于随机最优控制的框架，用于估计物理系统中罕见事件的committor函数并高效采样反应路径，在基准系统上比现有方法获得了更准确的committor估计、反应速率和平衡常数。

摘要翻译

生物分子的构象变化、相变及化学反应等罕见事件是众多物理系统行为的核心，然而其计算研究极为困难，因为无偏模拟极少能产生这类事件。过渡路径理论（Transition Path Theory, TPT）为此类事件的分析提供了严格的统计框架：它刻画了两个指定亚稳态（反应物与产物）之间反应轨迹的系综，其核心对象——承诺函数（committor function，即系统下一步抵达产物而非反应物的概率）——编码了所有关键的动力学与热力学信息。本文提出一个将承诺函数估计转化为随机最优控制（stochastic optimal control, SOC）问题的框架。在此表述中，承诺函数定义了一个反馈控制（与其对数的梯度成正比），该控制主动引导轨迹朝向反应区域，从而实现对反应路径的高效采样。为解决由此产生的击中时间控制问题，我们开发了两个互补的目标函数：直接反向传播损失函数与具有理论依据的离策略值匹配损失函数，并为其建立了一阶最优性保证。我们进一步针对亚稳态性（可能导致受控轨迹陷于中间势阱）提出了一种替代采样过程，该过程在保持反应流的同时降低了有效能垒。在基准测试系统中，本框架相比现有方法显著提升了承诺函数估计、反应速率及平衡常数的准确性。

摘要 (Abstract)

Rare events such as conformational changes in biomolecules, phase transitions, and chemical reactions are central to the behavior of many physical systems, yet they are extremely difficult to study computationally because unbiased simulations seldom produce them. Transition Path Theory (TPT) provides a rigorous statistical framework for analyzing such events: it characterizes the ensemble of reactive trajectories between two designated metastable states (reactant and product), and its central object–the committor function, which gives the probability that the system will next reach the product rather than the reactant–encodes all essential kinetic and thermodynamic information. We introduce a framework that casts committor estimation as a stochastic optimal control (SOC) problem. In this formulation the committor defines a feedback control–proportional to the gradient of its logarithm–that actively steers trajectories toward the reactive region, thereby enabling efficient sampling of reactive paths. To solve the resulting hitting-time control problem we develop two complementary objectives: a direct backpropagation loss and a principled off-policy Value Matching loss, for which we establish first-order optimality guarantees. We further address metastability, which can trap controlled trajectories in intermediate basins, by introducing an alternative sampling process that preserves the reactive current while lowering effective energy barriers. On benchmark systems, the framework yields markedly more accurate committor estimates, reaction rates, and equilibrium constants than existing methods.

关键词: rare events, stochastic optimal control, committor estimation, transition path theory, reactive trajectories, biomolecular conformational changes, chemical reactions, sampling efficiency

Token 消耗统计

总计: 902,316 tokens（输入 618,280 / 输出 284,036）

模型	输入	输出	合计
deepseek-chat	513,442	284,036	797,478
glm-4.7	104,838	0	104,838