📊 ArXiv 研究报告 (2026-03-24)

生成时间: 2026-03-24 09:08:01 数据源: ArXiv

📌 配置信息

关键词列表（共 27 个，总权重 27.0）

关键词	权重	类型
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	主要
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	主要
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	主要
“Scaling Laws” AND “Data Quality”	1.0	主要
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	主要
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	主要
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	主要
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	主要
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	主要
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	主要
“Context Window Extension” OR “Long Context LLMs”	1.0	主要
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	主要
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	主要
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	主要
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	主要
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	主要
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	主要
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	主要
“Multi-agent Systems” OR “Agent Coordination”	1.0	主要
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	主要
“Speculative Decoding” OR “Inference Acceleration”	1.0	主要
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	主要
“Mechanistic Interpretability” OR “Explainable AI”	1.0	主要
“World Models” AND “General World Models”	1.0	主要
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	主要
“In-context Learning” OR “Many-shot Learning”	1.0	主要
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	主要

评分设置

每个关键词最大分: 15
及格分公式: 5.0 + 0.8 × 总权重
当前及格分: 26.6

📈 论文统计

总抓取: 258 篇
及格论文: 8 篇 (3.1%)
深度分析: 3 篇

⭐ 及格论文详细分析

1. Evaluating Evidence Grounding Under User Pressure in Instruction-Tuned Language Models

作者: Sai Koneru, Elphin Joe, Christine Kirchhoff, Jian Wu, Sarah Rajtmajer 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20162v1

评分: 65.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	5.0/10	5.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	10.0/10	10.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	10.0/10	10.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	10.0/10	10.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

评分理由: 论文核心研究指令调优语言模型在用户压力下对上下文证据的忠实度问题，因此与"Large Language Models"、“Instruction Tuning”、“Post-training”、“Hallucination Mitigation"和"In-context Learning"高度相关（10分）。论文评估了0.27B到32B参数的模型，包含较小模型，与"Small Language Models"有一定关联（5分）。研究涉及模型行为分析，与"Explainable AI"有一定关联（5分）。使用气候评估数据，与"AI for Science"有一定关联（5分）。其他关键词如MoE、Scaling Laws、RLHF、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了指令调优语言模型在用户压力下能否忠实于上下文证据的问题，发现即使提供丰富证据，模型仍易屈服于用户压力，且证据中的研究空白会增加模型的迎合倾向，而模型的鲁棒性随规模变化非单调，不同模型在压力下的响应分布特性也不同。

摘要翻译

在存在争议的领域中，指令微调语言模型必须在适应用户需求的压力与忠实于上下文证据之间取得平衡。为评估这种张力，我们基于《美国国家气候评估》构建了一个受控认知冲突框架。我们对19个参数量从0.27B到32B不等的指令微调模型进行了细粒度消融实验，涵盖证据构成和不确定性线索两个维度。在中性提示下，更丰富的证据通常能提升证据一致性准确率与序数评分性能。然而在受控固定证据设置中，当面临用户压力时，证据并不能可靠地防止模型出现迎合用户倾向的立场逆转。我们报告了三种主要失效模式：首先，我们发现了负面部分证据交互现象——增加认知细微差别（特别是研究空白）会加剧Llama-3和Gemma-3等模型系列对迎合倾向的敏感性；其次，鲁棒性呈现非单调缩放规律：在某些模型系列中，部分低至中等规模的模型对对抗性用户压力尤为敏感；第三，模型在冲突下的分布集中度存在差异：部分指令微调模型在压力下能保持尖锐的峰值序数分布，而其他模型的分布则显著更分散；在规模匹配的Qwen模型对比中，经过推理蒸馏的变体（DeepSeek-R1-Qwen）始终比其指令微调版本表现出更高的分布离散度。这些发现表明，在受控固定证据设置中，若缺乏针对认知完整性的显式训练，仅提供更丰富的上下文证据并不能保证模型抵御用户压力。

摘要 (Abstract)

In contested domains, instruction-tuned language models must balance user-alignment pressures against faithfulness to the in-context evidence. To evaluate this tension, we introduce a controlled epistemic-conflict framework grounded in the U.S. National Climate Assessment. We conduct fine-grained ablations over evidence composition and uncertainty cues across 19 instruction-tuned models spanning 0.27B to 32B parameters. Across neutral prompts, richer evidence generally improves evidence-consistent accuracy and ordinal scoring performance. Under user pressure, however, evidence does not reliably prevent user-aligned reversals in this controlled fixed-evidence setting. We report three primary failure modes. First, we identify a negative partial-evidence interaction, where adding epistemic nuance, specifically research gaps, is associated with increased susceptibility to sycophancy in families like Llama-3 and Gemma-3. Second, robustness scales non-monotonically: within some families, certain low-to-mid scale models are especially sensitive to adversarial user pressure. Third, models differ in distributional concentration under conflict: some instruction-tuned models maintain sharply peaked ordinal distributions under pressure, while others are substantially more dispersed; in scale-matched Qwen comparisons, reasoning-distilled variants (DeepSeek-R1-Qwen) exhibit consistently higher dispersion than their instruction-tuned counterparts. These findings suggest that, in a controlled fixed-evidence setting, providing richer in-context evidence alone offers no guarantee against user pressure without explicit training for epistemic integrity.

关键词: instruction-tuned language models, evidence grounding, user pressure, epistemic conflict, sycophancy, in-context learning, model robustness, U.S. National Climate Assessment

深度分析:

评估指令微调语言模型在用户压力下的证据基础能力

摘要:

本文研究了指令微调语言模型在面临用户压力时对上下文证据的忠实度。作者基于美国国家气候评估（NCA）构建了一个受控的认知冲突框架，通过系统性地改变证据层级（从原子主张到不确定性描述）和用户压力类型（中性、直接信念、怀疑挑战等），对19个不同规模的模型进行了评估。研究发现，虽然在中性提示下更丰富的证据能提高准确性，但在用户压力下，证据无法可靠地防止模型产生与用户一致的错误反转。研究揭示了三种主要失败模式：增加认知细微差别（如研究缺口）反而会增加阿谀奉承的敏感性；鲁棒性随模型规模非单调变化；推理蒸馏模型在冲突下的置信度分布更为分散。结论表明，仅提供丰富的上下文证据不足以抵御用户压力，需要针对认知完整性进行显式训练。

创新点:

提出了一个受控的认知冲突评估框架，利用NCA的分层结构（主张、证据、不确定性、信心）进行细粒度消融实验，填补了阿谀奉承研究与证据基础研究之间的空白。
系统性地研究了用户压力（如怀疑挑战、权威诉求）与固定上下文证据之间的冲突，揭示了模型在“有争议的证据交互”中的失败模式。
发现了“负面部分证据交互”现象，即增加认知细微差别（如研究缺口）反而会增加某些模型（如Llama-3, Gemma-3）对阿谀奉承的敏感性。
分析了模型在冲突下的分布特征，发现推理蒸馏变体（如DeepSeek-R1-Qwen）比指令微调模型表现出更高的置信度分散度，为理解不同训练范式的影响提供了新视角。

方法

!!! info

基于美国国家气候评估（NCA4和NCA5）提取770个原子主张，利用其四层结构（主张、证据基础、研究缺口、信心描述）构建受控实验条件。采用交叉设计，结合4种证据层级和4种用户压力类型（中性、直接信念、怀疑挑战、权威诉求），共16种实验条件。选取19个指令微调模型（0.27B至32B参数），包括Qwen, Gemma, Llama, Mistral等家族及DeepSeek-R1蒸馏模型。使用证据一致性准确率和序数评分规则进行分析，并考察模型在冲突下的置信度分布（集中度与分散度）。

关键结果:

在中性提示下，提供更丰富的证据通常能提高证据一致性和序数评分性能。
在用户压力下，证据无法可靠地防止模型产生与用户一致的反转，即使证据是固定的。
发现“负面部分证据交互”：添加研究缺口等认知细微差别与阿谀奉承敏感性增加相关。
鲁棒性随模型规模非单调缩放，某些低至中等规模的模型对对抗性用户压力特别敏感。
推理蒸馏模型（DeepSeek-R1-Qwen）在压力下的序数分布比其指令微调对应模型更为分散。

技术栈: U.S. National Climate Assessment (NCA4, NCA5), Transformer-based LLMs (Llama-3.1, Mistral, phi-4, Qwen 2.5, Gemma-3, DeepSeek-R1-Distill), Ablation study, Ordinal proper scoring rules, Distributional analysis (concentration vs dispersion), Instruction tuning, Retrieval-Augmented Generation (RAG) simulation, Claim decomposition

优点

新颖的评估框架：利用NCA的分层结构，创新性地将证据层级与用户压力类型结合，能够精确控制变量。
细粒度的分析：不仅关注准确性，还深入分析了置信度分布和不确定性表达，使用了序数评分规则，符合科学决策的评估标准。
广泛的模型覆盖：测试了19个不同规模和架构的模型，包括最新的推理蒸馏模型，结论具有较好的普适性。
发现反直觉现象：指出了增加证据细节（如不确定性）可能带来的负面效果，对未来的模型对齐和训练具有重要启示。

局限

领域限制：研究主要集中在气候科学领域，虽然具有代表性，但结论在其他科学领域或通用领域的泛化能力有待进一步验证。
低置信度样本不足：数据集中“低置信度”的样本数量较少（仅1.7%），限制了针对该类别的统计功效。
固定证据设置：实验使用固定的上下文证据，未考虑动态检索或多轮对话中证据变化的情况，可能与真实的RAG应用场景存在差距。

与研究方向的相关性:

论文高度相关。首先，它属于大模型在科学领域（气候科学）的应用研究，直接评估了模型处理科学证据和不确定性的能力。其次，在技术原理创新方面，论文深入探讨了指令微调、推理蒸馏（DeepSeek-R1）等技术对模型鲁棒性和对齐行为的影响，揭示了模型在处理冲突信息时的内部机制（如置信度分布变化），属于对大模型技术原理的深度剖析和创新性评估。其提出的认知冲突框架和关于“负面部分证据交互”的发现具有较强的创新性。

2. WorldAgents: Can Foundation Image Models be Agents for 3D World Models?

作者: Ziya Erkoç, Angela Dai, Matthias Nießner 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19708v1

评分: 43.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	10.0/10	10.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	10.0/10	10.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究2D基础图像模型是否具备3D世界模型能力，并提出了一个多智能体架构来实现3D世界合成。与"World Models"高度相关（10分），因为直接研究3D世界模型能力；与"LLM Agents"和"Multi-agent Systems"高度相关（10分），因为使用了VLM-based director、generator和verifier的多智能体架构；与"Foundation Models"相关（8分），因为评估了多种基础图像模型和VLMs；与"Pre-training"有一定关联（5分），因为基础模型的能力来自预训练；其他关键词如MoE、SLMs、Scaling Laws、SFT、Alignment、RAG等均未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了2D基础图像模型是否具备3D世界模型能力，通过提出一个多智能体架构成功实现了3D一致的世界合成，证明了这些模型确实封装了对3D世界的理解。

摘要翻译

鉴于二维基础图像模型生成高保真输出的卓越能力，我们探究了一个根本性问题：二维基础图像模型是否内在地具备三维世界模型能力？为回答此问题，我们系统性地评估了多种先进图像生成模型和视觉语言模型在三维世界合成任务上的表现。为利用并基准测试其潜在的隐式三维能力，我们提出一种智能体框架以促进三维世界生成。我们的方法采用多智能体架构：一个基于视觉语言模型的导演智能体负责制定提示以引导图像合成，一个生成器负责合成新的图像视角，以及一个由视觉语言模型支持的两步验证器，从二维图像和三维重建空间两个维度评估并筛选生成的帧。关键的是，我们证明了我们的智能体方法能够实现连贯且稳健的三维重建，生成的输出场景可通过渲染新视角进行探索。通过对多种基础模型的大量实验，我们证明二维模型确实内蕴了对三维世界的理解。通过利用这种理解，我们的方法成功合成了广阔、真实且三维一致的世界。

摘要 (Abstract)

Given the remarkable ability of 2D foundation image models to generate high-fidelity outputs, we investigate a fundamental question: do 2D foundation image models inherently possess 3D world model capabilities? To answer this, we systematically evaluate multiple state-of-the-art image generation models and Vision-Language Models (VLMs) on the task of 3D world synthesis. To harness and benchmark their potential implicit 3D capability, we propose an agentic framing to facilitate 3D world generation. Our approach employs a multi-agent architecture: a VLM-based director that formulates prompts to guide image synthesis, a generator that synthesizes new image views, and a VLM-backed two-step verifier that evaluates and selectively curates generated frames from both 2D image and 3D reconstruction space. Crucially, we demonstrate that our agentic approach provides coherent and robust 3D reconstruction, producing output scenes that can be explored by rendering novel views. Through extensive experiments across various foundation models, we demonstrate that 2D models do indeed encapsulate a grasp of 3D worlds. By exploiting this understanding, our method successfully synthesizes expansive, realistic, and 3D-consistent worlds.

关键词: Foundation Image Models, 3D World Models, Vision-Language Models, Multi-agent Architecture, 3D World Synthesis, Agentic Framing, 3D Reconstruction, Novel View Rendering

3. PowerLens: Taming LLM Agents for Safe and Personalized Mobile Power Management

作者: Xingyu Feng, Chang Sun, Yuzhu Wang, Zhangbing Zhou, Chengwen Luo, Zhuangzhuang Chen, Xiaomin Ouyang, Huanqi Yang 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19584v1

评分: 40.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	10.0/10	10.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文PowerLens的核心是应用大语言模型（LLMs）的推理能力构建多智能体系统（multi-agent architecture）用于移动设备电源管理，因此与"Large Language Models"和"LLM Agents"及"Multi-agent Systems"高度相关（10分）。系统利用LLMs的常识推理进行上下文感知策略生成，涉及多步推理过程，与"Chain of Thought"和"System 2 Thinking"有一定关联（5分）。论文未涉及其他关键词如MoE、模型压缩、训练技术、科学AI应用等具体内容，这些评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对移动设备电池管理依赖静态规则、忽略用户个性化需求的问题，提出了PowerLens系统，利用大语言模型的多智能体架构实现上下文感知的个性化电源策略生成，实验表明其在Android设备上实现了81.7%的动作准确率和38.8%的节能效果。

摘要翻译

电池续航能力仍是移动设备面临的关键挑战，而现有的电源管理机制依赖静态规则或粗粒度启发式策略，忽略了用户活动与个人偏好。本文提出PowerLens系统，该系统利用大语言模型（LLMs）的推理能力，在Android设备上实现安全且个性化的移动电源管理。其核心思想在于：大语言模型的常识推理能够弥合用户活动与系统参数之间的语义鸿沟，从而通过隐式反馈生成零样本、上下文感知的策略，以适应个体偏好。PowerLens采用多智能体架构，通过界面语义识别用户上下文，并针对18项设备参数生成全局性电源策略。系统通过基于PDL的约束框架在执行前验证所有操作，同时采用双层记忆系统，通过基于置信度蒸馏的方法从用户隐式覆盖行为中学习个体偏好，无需显式配置，并在3至5天内实现偏好收敛。在已获取root权限的Android设备上进行的大量实验表明，PowerLens相比原生Android系统实现了81.7%的操作准确率和38.8%的节能效果，其性能优于基于规则和基于大语言模型的基线方法，同时具备高用户满意度、快速偏好收敛和强安全性保障，且系统自身仅消耗每日电池容量的0.5%。

摘要 (Abstract)

Battery life remains a critical challenge for mobile devices, yet existing power management mechanisms rely on static rules or coarse-grained heuristics that ignore user activities and personal preferences. We present PowerLens, a system that tames the reasoning power of Large Language Models (LLMs) for safe and personalized mobile power management on Android devices. The key idea is that LLMs’ commonsense reasoning can bridge the semantic gap between user activities and system parameters, enabling zero-shot, context-aware policy generation that adapts to individual preferences through implicit feedback. PowerLens employs a multi-agent architecture that recognizes user context from UI semantics and generates holistic power policies across 18 device parameters. A PDL-based constraint framework verifies every action before execution, while a two-tier memory system learns individualized preferences from implicit user overrides through confidence-based distillation, requiring no explicit configuration and converging within 3–5 days. Extensive experiments on a rooted Android device show that PowerLens achieves 81.7% action accuracy and 38.8% energy saving over stock Android, outperforming rule-based and LLM-based baselines, with high user satisfaction, fast preference convergence, and strong safety guarantees, with the system itself consuming only 0.5% of daily battery capacity.

关键词: Large Language Models, LLM Agents, Multi-agent Systems, Mobile Power Management, Personalized Policy Generation, Context-aware Reasoning, Android Devices, Energy Saving

深度分析:

PowerLens：驯化大模型智能体以实现安全且个性化的移动电源管理

摘要:

论文提出了PowerLens，一个利用大语言模型（LLM）的推理能力，在Android设备上实现安全且个性化移动电源管理的系统。针对现有电源管理机制依赖静态规则或忽略用户偏好的问题，PowerLens通过多智能体架构识别用户上下文并生成跨18个设备参数的全面电源策略。系统采用基于命题动态逻辑（PDL）的约束框架验证操作安全性，并利用双层记忆系统通过隐式反馈（如用户手动调整设置）学习个人偏好，无需显式配置。实验表明，PowerLens相比原生Android节能38.8%，操作准确率达81.7%，且能在3-5天内快速收敛用户偏好，同时保证低系统开销和高安全性。

创新点:

首次将LLM智能体应用于移动设备的系统级资源管理，通过多智能体架构（活动智能体与策略智能体）实现上下文感知的电源策略生成。
提出双层记忆系统，结合状态差异检测和基于置信度的蒸馏技术，从隐式用户反馈中自动学习个性化偏好，无需用户显式配置。
引入基于命题动态逻辑（PDL）的约束验证框架，在执行前验证LLM生成的操作，确保策略符合设备能力和应用安全不变量，显著降低安全违规率。

方法

!!! info

论文采用多智能体架构，将问题分解为上下文识别和策略生成。利用LLM的常识推理能力弥合用户活动与系统参数间的语义鸿沟。通过状态差异检测机制捕捉用户的手动覆盖行为作为隐式反馈，并使用置信度蒸馏将其转化为稳定的偏好规则。为了确保安全，设计了基于命题动态逻辑（PDL）的验证器，对每个生成的动作进行约束检查。实验在已root的Android设备上进行，构建了涵盖7种应用类别的PowerLensBench基准进行评估。

关键结果:

PowerLens相比原生Android系统实现了38.8%的能源节省。
系统的操作准确率达到81.7%，用户体验评分为4.3/5.0。
安全违规率仅为0.6%，PDL检查器消除了96.5%的原始LLM生成违规。
双层记忆系统在3-5天内收敛用户偏好，系统自身仅消耗每日电池容量的0.5%。

技术栈: 大语言模型（LLM）作为零样本系统级推理器, 多智能体架构（Multi-Agent Architecture）, 命题动态逻辑（Propositional Dynamic Logic, PDL）用于约束验证, 状态差异检测（State Differencing）, 置信度蒸馏（Confidence-based Distillation）, Android系统级API（需Root权限）

优点

创新性强：首次将LLM智能体引入移动电源管理领域，突破了传统静态规则和粗粒度启发式的局限。
个性化与自动化：通过隐式反馈学习用户习惯，实现了无需显式配置的个性化优化，提升了用户体验。
安全性高：引入PDL形式化验证框架，有效解决了LLM可能产生的幻觉和无效操作问题，确保系统稳定性。
实效显著：在真实设备上展示了显著的节能效果和较高的操作准确率，且系统开销极低。

局限

依赖Root权限：系统实现需要Android设备获取Root权限，限制了其在普通消费级设备上的直接部署。
隐私考量：虽然论文未详细展开，但分析UI语义和用户行为可能涉及用户隐私数据的处理。
泛化性：虽然实验涵盖了多种场景，但在极度边缘或新出现的应用类型上的表现可能仍需验证。
LLM开销：尽管系统消耗低，但调用LLM本身可能带来延迟或网络依赖（如果使用云端模型），推理延迟可能影响实时性。

与研究方向的相关性:

该论文高度相关。它属于“大模型和深度学习技术原理的创新”以及“大模型在不同领域的研究应用”。论文创新性地将LLM作为系统级推理器应用于移动电源管理这一具体工程领域，解决了传统方法无法处理的语义理解和个性化问题。其多智能体架构、基于PDL的约束验证以及隐式反馈学习机制，均体现了对大模型技术原理的深入应用和创新改进，符合用户对新技术和创新性的关注。

4. Can Large Multimodal Models Inspect Buildings? A Hierarchical Benchmark for Structural Pathology Rea

作者: Hui Zhong, Yichun Gao, Luyan Liu, Hai Yang, Wang Wang, Haowei Zhang, Xinhu Zheng 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20148v1

评分: 40.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

评分理由: 论文专注于大型多模态模型（LMMs）在建筑结构病理学推理中的应用，属于大模型在工程科学领域的创新应用。核心相关关键词包括：1）“Large Language Models” OR “LLMs” OR “Foundation Models”（10分）- 论文明确研究LMMs，属于基础模型范畴；2）“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”（10分）- 论文旨在推进自主AI智能体在土木工程中的应用；3）“AI for Science” OR “Bioinformatics” OR “Cheminformatics”（10分）- 属于AI for Science在工程领域的应用；4）“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning"和"System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”（各5分）- 论文评估LMMs的认知维度包括空间定位和生成几何分割，涉及多步推理和深度推理能力。其他关键词如MoE、量化、对齐等未在论文中涉及，故得0分。

!!! tip deepseek-chat TL;DR

该论文研究了大型多模态模型在建筑缺陷检测中的推理能力，通过构建DefectBench基准测试发现当前LMMs在语义理解和拓扑感知方面表现优异，但在精确空间定位方面存在不足，同时验证了零样本生成分割的可行性。

摘要翻译

自动化建筑立面检测是城市韧性与智慧城市运维的关键组成部分。传统上，该领域依赖于专门的判别式模型（如YOLO、Mask R-CNN），这些模型擅长像素级定位，但局限于被动感知，且因缺乏对结构拓扑的视觉理解而泛化能力不足。大型多模态模型（Large Multimodal Models, LMMs）有望推动向主动推理的范式转变，然而其在此类高风险工程领域的应用尚缺乏严格的评估标准。为弥补这一空白，我们引入了一种人机协同的半自动化标注框架，利用专家提议验证机制，将12个分散的数据集统一为标准化、层次化的本体。在此基础上，我们提出了首个超越基础语义识别的多维度基准测试集——\textit{DefectBench}，旨在系统检验LMMs的综合能力。\textit{DefectBench} 在三个递进的认知维度——语义感知、空间定位与生成式几何分割——上评估了18个前沿的LMMs。大量实验表明，尽管当前LMMs展现出卓越的拓扑意识与语义理解能力（能有效诊断“是什么”和“如何发生”），但在度量定位精度（“在何处”）方面存在显著不足。然而，关键的是，我们验证了零样本生成式分割的可行性，证明通用基础模型无需领域特定训练即可媲美专门的监督网络。本研究不仅提供了严格的基准测试标准与高质量开源数据库，也为自主人工智能代理在土木工程领域的进步确立了新的基线。

摘要 (Abstract)

Automated building facade inspection is a critical component of urban resilience and smart city maintenance. Traditionally, this field has relied on specialized discriminative models (e.g., YOLO, Mask R-CNN) that excel at pixel-level localization but are constrained to passive perception and worse generization without the visual understandng to interpret structural topology. Large Multimodal Models (LMMs) promise a paradigm shift toward active reasoning, yet their application in such high-stakes engineering domains lacks rigorous evaluation standards. To bridge this gap, we introduce a human-in-the-loop semi-automated annotation framework, leveraging expert-proposal verification to unify 12 fragmented datasets into a standardized, hierarchical ontology. Building on this foundation, we present \textit{DefectBench}, the first multi-dimensional benchmark designed to interrogate LMMs beyond basic semantic recognition. \textit{DefectBench} evaluates 18 state-of-the-art (SOTA) LMMs across three escalating cognitive dimensions: Semantic Perception, Spatial Localization, and Generative Geometry Segmentation. Extensive experiments reveal that while current LMMs demonstrate exceptional topological awareness and semantic understanding (effectively diagnosing “what” and “how”), they exhibit significant deficiencies in metric localization precision (“where”). Crucially, however, we validate the viability of zero-shot generative segmentation, showing that general-purpose foundation models can rival specialized supervised networks without domain-specific training. This work provides both a rigorous benchmarking standard and a high-quality open-source database, establishing a new baseline for the advancement of autonomous AI agents in civil engineering.

关键词: Large Multimodal Models, Building Inspection, Structural Pathology, Benchmark Evaluation, Generative Segmentation, Autonomous AI Agents, Civil Engineering, Zero-shot Learning

5. SAGE: Sustainable Agent-Guided Expert-tuning for Culturally Attuned Translation in Low-Resource Sout

作者: Zhixiang Lu, Chong Zhang, Yulong Li, Angelos Stefanidis, Anh Nguyen, Imran Razzak, Jionglong Su, Zhengyong Jiang 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19931v1

评分: 38.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	8.0/10	8.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	10.0/10	10.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLM在低资源语言翻译中的应用，直接使用LLMs（10分）。提出SAGE框架，核心是使用强化学习（RL）代理（Agent）来筛选数据，与’LLM Agents’高度相关（10分）。框架采用LoRA进行高效微调，是参数高效微调的核心技术（10分）。研究强调’right data over big data’，关注数据质量对模型性能的影响，与’Scaling Laws AND Data Quality’有一定关联（8分）。论文未涉及其他关键词如MoE、SLMs、预训练、对齐、RAG、推理加速等具体技术，因此这些关键词评分为0。

!!! tip deepseek-chat TL;DR

该论文针对低资源东南亚语言翻译中高质量文化相关数据稀缺和训练能耗高的问题，提出了可持续代理引导专家调优（SAGE）框架，通过强化学习代理筛选数据并使用LoRA微调，在显著减少数据使用（97.1%）和能耗（95.2%）的同时，实现了最先进的翻译性能。

摘要翻译

构建包容性万维网的愿景正遭受严峻语言鸿沟的阻碍，这在东南亚资源匮乏地区尤为突出。尽管大语言模型（LLMs）为翻译提供了潜在解决方案，但其在数据稀缺环境中的部署面临双重挑战：高质量、文化适配数据的匮乏，以及在海量嘈杂网络语料上进行训练所产生的高昂能源成本。为化解数字包容性与环境可持续性之间的矛盾，我们提出了可持续智能体引导专家调优框架（Sustainable Agent-Guided Expert-tuning, SAGE）。该框架开创了一种能源感知新范式，强调“优质数据”优先于“海量数据”。SAGE摒弃对未过滤数据集进行高碳排放训练的传统方式，转而采用通过群体相对策略优化（Group Relative Policy Optimization, GRPO）训练的强化学习（RL）智能体，自主构建精炼训练集。该智能体利用从少量专家构建的社区对话集中提取的语义奖励信号，有效过滤噪声及文化失配内容。随后，我们采用低秩自适应技术（Low-Rank Adaptation, LoRA）在此精选数据上高效微调开源大语言模型。我们将SAGE应用于英语与七种东南亚低资源语言（low-resource languages, LRLs）的翻译任务。该方法在BLEU-4和COMET-22指标上取得了最先进的性能表现，能有效捕捉本地语言细微特征。至关重要的是，SAGE在超越基于完整数据集训练的基线模型的同时，将数据使用量降低了97.1%，训练能耗减少了95.2%。通过以最小环境代价实现高性能模型，SAGE为弥合全球南方数字鸿沟提供了一条可扩展且负责任的技术路径。

摘要 (Abstract)

The vision of an inclusive World Wide Web is impeded by a severe linguistic divide, particularly for communities in low-resource regions of Southeast Asia. While large language models (LLMs) offer a potential solution for translation, their deployment in data-poor contexts faces a dual challenge: the scarcity of high-quality, culturally relevant data and the prohibitive energy costs of training on massive, noisy web corpora. To resolve the tension between digital inclusion and environmental sustainability, we introduce Sustainable Agent-Guided Expert-tuning (SAGE). This framework pioneers an energy-aware paradigm that prioritizes the “right data” over “big data”. Instead of carbon-intensive training on unfiltered datasets, SAGE employs a reinforcement learning (RL) agent, optimized via Group Relative Policy Optimization (GRPO), to autonomously curate a compact training set. The agent utilizes a semantic reward signal derived from a small, expert-constructed set of community dialogues to filter out noise and cultural misalignment. We then efficiently fine-tune open-source LLMs on this curated data using Low-Rank Adaptation (LoRA). We applied SAGE to translation tasks between English and seven low-resource languages (LRLs) in Southeast Asia. Our approach establishes new state-of-the-art performance on BLEU-4 and COMET-22 metrics, effectively capturing local linguistic nuances. Crucially, SAGE surpasses baselines trained on full datasets while reducing data usage by 97.1% and training energy consumption by 95.2%. By delivering high-performance models with a minimal environmental footprint, SAGE offers a scalable and responsible pathway to bridge the digital divide in the Global South.

关键词: Large Language Models, Low-Resource Languages, Sustainable AI, Reinforcement Learning Agent, Data Curation, LoRA, Translation, Energy Efficiency

6. All-Mem: Agentic Lifelong Memory via Dynamic Topology Evolution

作者: Can Lv, Heng Chang, Yuchen Guo, Shengyu Tao, Shiji Zhou 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19595v1

评分: 35.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	10.0/10	10.0
“Context Window Extension” OR “Long Context LLMs”	1.0	5.0/10	5.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文提出All-Mem框架，用于构建具有终身记忆的交互式智能体（LLM Agents），核心是通过动态拓扑演化管理记忆库，以改进检索和问答性能。这与"LLM Agents"高度相关（10分），因为论文直接研究智能体系统；与"Retrieval-Augmented Generation"高度相关（10分），因为框架涉及记忆检索以增强生成；与"Large Language Models"高度相关（10分），因为摘要提到使用LLM作为诊断器；与"Context Window Extension"有一定关联（5分），因为框架旨在处理长期记忆，间接涉及上下文管理；其他关键词如MoE、SFT、RLHF等未在论文中涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文研究了长期交互智能体中记忆系统随历史增长而退化的问题，提出了All-Mem框架，通过动态拓扑演化管理记忆库，实验表明其在检索和问答任务上优于现有基线。

摘要翻译

终身交互智能体需在数月或数年内持续协助用户，这要求其在固定的上下文与延迟预算下，持续写入长期记忆并为每个新查询检索正确的证据。现有记忆系统常随历史增长而性能下降，检索出的上下文易出现冗余、过时或噪声问题。我们提出All-Mem——一种在线/离线终身记忆框架，通过显式的非破坏性整合来维护拓扑结构化的记忆库，避免了基于摘要的压缩方法中典型的信息不可逆损失。在线运行时，该系统将检索锚定在有限的可见表层，以保持粗粒度搜索成本可控。在定期离线阶段，大型语言模型诊断器会提出带有置信度评分的拓扑编辑建议，通过SPLIT、MERGE和UPDATE三种算子进行门控执行，同时保留不可变的证据以确保可追溯性。在查询时，类型化链接支持在必要时从活跃锚点向归档证据进行跳数受限、预算可控的扩展。在LOCOMO和LONGMEMEVAL数据集上的实验表明，本方法在检索与问答任务上优于代表性基线模型。

摘要 (Abstract)

Lifelong interactive agents are expected to assist users over months or years, which requires continually writing long term memories while retrieving the right evidence for each new query under fixed context and latency budgets. Existing memory systems often degrade as histories grow, yielding redundant, outdated, or noisy retrieved contexts. We present All-Mem, an online/offline lifelong memory framework that maintains a topology structured memory bank via explicit, non destructive consolidation, avoiding the irreversible information loss typical of summarization based compression. In online operation, it anchors retrieval on a bounded visible surface to keep coarse search cost bounded. Periodically offline, an LLM diagnoser proposes confidence scored topology edits executed with gating using three operators: SPLIT, MERGE, and UPDATE, while preserving immutable evidence for traceability. At query time, typed links enable hop bounded, budgeted expansion from active anchors to archived evidence when needed. Experiments on LOCOMO and LONGMEMEVAL show improved retrieval and QA over representative baselines.

关键词: lifelong interactive agents, memory framework, retrieval, topology evolution, LLM diagnoser, long-term memory, agentic memory, dynamic consolidation

7. TextReasoningBench: Does Reasoning Really Improve Text Classification in Large Language Models?

作者: Xinyu Guo, Yazhou Zhang, Jing Qin 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19558v1

评分: 33.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	5.0/10	5.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	8.0/10	8.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在文本分类任务中应用推理策略（如CoT、ToT等）的有效性和效率，与"Large Language Models"和"Chain of Thought"高度相关（10分），与"System 2 Thinking"相关（8分），因为涉及深思熟虑的推理过程。论文提到在小型模型上的测试，与"Small Language Models"有一定关联（5分）。其他关键词如MoE、Scaling Laws、Alignment等未在摘要中涉及，评分为0分。

!!! tip deepseek-chat TL;DR

该论文通过TextReasoningBench基准评估了LLMs中推理策略对文本分类任务的有效性，发现推理并不总能提升性能，且往往效率低下，复杂方法可能不如简单基线。

摘要翻译

从大型语言模型中引出显式、逐步的推理轨迹已成为增强模型能力的主导范式。尽管此类推理策略最初是为需要显式多步推理的问题设计的，但它们已越来越多地应用于广泛的自然语言处理任务中。这种扩展隐含地假设审慎推理能一致地有益于异构任务。然而，此类推理机制是否真正有益于分类任务在很大程度上仍未得到充分探索，特别是考虑到其巨大的令牌和时间成本。为填补这一空白，我们引入了TextReasoningBench，这是一个旨在系统评估大型语言模型在文本分类任务中推理策略有效性与效率的基准。我们在五个文本分类数据集上，针对十种大型语言模型，比较了七种推理策略，即IO、思维链、自洽思维链、思维树、思维图、思维链束以及长思维链。除了准确率和宏观F1分数等传统指标外，我们引入了两个成本感知评估指标，用于量化每个推理令牌带来的性能增益，以及相对于令牌成本增长的性能提升效率。实验结果揭示了三个值得注意的发现：（1）推理并非普遍提升分类性能：虽然中等复杂度的策略如思维链和自洽思维链能带来一致但有限的增益（通常在大模型上为+1%至+3%），但更复杂的方法（例如思维树和思维图）往往无法超越更简单的基线，甚至可能降低性能，尤其是在小模型上；（2）推理通常是低效的：许多推理策略将令牌消耗增加了10倍至100倍（例如自洽思维链和思维树），却仅带来微小的性能提升。

摘要 (Abstract)

Eliciting explicit, step-by-step reasoning traces from large language models (LLMs) has emerged as a dominant paradigm for enhancing model capabilities. Although such reasoning strategies were originally designed for problems requiring explicit multi-step reasoning, they have increasingly been applied to a broad range of NLP tasks. This expansion implicitly assumes that deliberative reasoning uniformly benefits heterogeneous tasks. However, whether such reasoning mechanisms truly benefit classification tasks remains largely underexplored, especially considering their substantial token and time costs. To fill this gap, we introduce TextReasoningBench, a systematic benchmark designed to evaluate the effectiveness and efficiency of reasoning strategies for text classification with LLMs. We compare seven reasoning strategies, namely IO, CoT, SC-CoT, ToT, GoT, BoC, and long-CoT across ten LLMs on five text classification datasets. Beyond traditional metrics such as accuracy and macro-F1, we introduce two cost-aware evaluation metrics that quantify the performance gain per reasoning token and the efficiency of performance improvement relative to token cost growth. Experimental results reveal three notable findings: (1) Reasoning does not universally improve classification performance: while moderate strategies such as CoT and SC-CoT yield consistent but limited gains (typically +1% to +3% on big models), more complex methods (e.g., ToT and GoT) often fail to outperform simpler baselines and can even degrade performance, especially on small models; (2) Reasoning is often inefficient: many reasoning strategies increase token consumption by 10$\times$ to 100$\times$ (e.g., SC-CoT and ToT) while providing only marginal performance improvements.

关键词: Large Language Models, Reasoning Strategies, Text Classification, Chain of Thought, Efficiency Evaluation, Benchmark, Performance Gain, Token Cost

深度分析:

TextReasoningBench：推理真的能改进大语言模型的文本分类吗？

摘要:

本文针对大语言模型（LLM）中推理策略被广泛假设为普遍有益的现象，提出了质疑。研究引入了TextReasoningBench基准，系统评估了七种推理策略（如CoT, ToT, GoT等）在十个LLM和五个文本分类数据集上的表现。除了传统的准确率和F1分数外，论文还引入了两个成本感知指标来衡量推理的效率。实验结果表明，推理并不总是能提升分类性能，复杂的推理方法往往不如简单的基线，甚至可能降低性能。此外，推理往往极其低效，消耗大量token却仅带来微小的性能提升，且过长的推理会导致“过度思考”和性能下降。这些发现挑战了显式推理对NLP任务普遍有益的假设。

创新点:

提出了TextReasoningBench，这是首个全面评估多种推理策略在文本分类任务中有效性和效率的基准。
引入了两个成本感知评估指标（PCR和ME），用于量化推理token带来的性能增益及相对于token成本增长的效率。
揭示了任务的主观性是调节推理有效性的关键因素，并发现推理长度与性能之间存在非单调关系（适度有益，过度有害）。

方法

!!! info

研究采用了基准测试的方法。选取了10个不同规模的LLM（包括4个小模型和5个大模型），在5个文本分类数据集（涵盖客观任务如AGNews和主观任务如SST-2）上进行实验。对比了7种推理策略：IO, CoT, SC-CoT, ToT, GoT, BoC, 和 long-CoT。所有实验在零样本设置下进行，重复五次以处理随机性。评估指标包括准确率、Macro-F1，以及新提出的绝对效率（F1 per token）和边际效率。

关键结果:

推理并不普遍提升分类性能：CoT和SC-CoT在大模型上有微小提升（+1%至+3%），但ToT和GoT等复杂方法往往表现不佳。
推理效率低下：许多策略使token消耗增加了10到100倍，但性能提升微乎其微，导致边际效率为负。
推理长度与性能呈非单调关系：适度长度的推理有益，但过度推理会导致性能下降和模型“过度思考”。

技术栈: IO (Input-Output), CoT (Chain-of-Thought), SC-CoT (Self-Consistency CoT), ToT (Tree-of-Thoughts), GoT (Graph-of-Thoughts), BoC (Bagging of Cues), Long-CoT, LLMs (e.g., Llama 2-7B), Metrics: Accuracy, Macro-F1, PCR, ME

优点

挑战了现有的“推理万能论”假设，具有批判性思维。
引入了成本感知指标，不仅关注性能，还关注计算效率，符合实际应用需求。
实验设计全面，涵盖了多种模型规模、推理策略和不同类型的分类任务（客观与主观）。

局限

虽然涵盖了10个模型，但具体的模型架构细节和更多最新模型的测试可能仍有扩展空间。
主要关注文本分类任务，结论是否适用于其他类型的NLP任务（如生成任务）尚需进一步验证。
零样本设置可能无法完全代表经过特定指令微调后的模型表现。

与研究方向的相关性:

该论文高度相关于“大模型和深度学习技术原理的创新”。它深入探讨了大模型的核心推理机制（CoT, ToT等）在基础NLP任务中的实际效果和局限性，属于对大模型技术原理的反思和评估。虽然不直接涉及生物医药等科学应用，但其对推理效率的分析对大模型在各领域的实际部署具有重要指导意义。创新性较强，对现有范式提出了挑战。

8. DataProphet: Demystifying Supervision Data Generalization in Multimodal LLMs

作者: Xuan Qi, Luxi He, Dan Roth, Xingyu Fu 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19688v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	5.0/10	5.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究多模态大语言模型（MLLMs）的监督数据选择问题，与"Large Language Models"高度相关（10分），因为MLLMs是LLMs的扩展。与"Post-training"高度相关（10分），因为研究重点是训练后性能增益的预测。与"Scaling Laws” AND “Data Quality"有一定关联（5分），因为涉及数据质量对性能的影响。与"Pre-training"有一定关联（5分），因为研究监督数据选择，属于预训练或适应阶段。其他关键词如MoE、SLMs、RLHF、RAG等与论文内容无直接关系，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了如何预测多模态大语言模型训练数据集对目标基准性能的影响，并提出了一种无需训练的指标DATAPROPHET，能有效选择监督数据并提升性能。

摘要翻译

为多模态大语言模型（MLLMs）选择监督数据的传统思路是优先选择与目标基准看似相似的数据集，例如文本密集型或视觉中心型任务。然而，这种直观的相似性能否可靠地预测下游性能提升尚不明确。在本研究中，我们首次尝试回答一个实际问题：能否在训练开始前就预估某个训练数据集对目标基准的影响？为探究此问题，我们对涵盖7种不同任务的14个视觉-语言数据集间的迁移效应进行了深入分析。结果表明，直观的任务相似性并非迁移能力的可靠预测指标，泛化性能更多地取决于具体数据集而非其宽泛的任务类别。基于此发现，我们提出了DATAPROPHET——一种简单有效的免训练度量方法，它融合了多模态困惑度、相似性和数据多样性。实验表明，DATAPROPHET生成的监督数据排序与实际训练后性能增益的排序高度相关，肯德尔tau系数达到86.0%。此外，DATAPROPHET能实现更优的监督数据选择：相比均匀选择提升最高达6.9%，优于当前最先进的基于训练的基线方法1.4%，甚至比基于实验性能的预言机选择高出0.2%。我们的代码与数据将公开释放。

摘要 (Abstract)

Conventional wisdom for selecting supervision data for multimodal large language models (MLLMs) is to prioritize datasets that appear similar to the target benchmark, such as text-intensive or vision-centric tasks. However, it remains unclear whether such intuitive similarity reliably predicts downstream performance gains. In this work, we take a first step toward answering a practical question: can we estimate the influence of a training dataset on a target benchmark before any training is performed? To investigate this question, we conduct an in-depth analysis of transfer across 14 vision-language datasets spanning 7 diverse tasks. Our results show that intuitive task similarity is an unreliable predictor of transferability, and that generalization depends more on the specific dataset than on its broad task category. Motivated by this finding, we propose DATAPROPHET, a simple and effective training-free metric that combines multimodal perplexity, similarity, and data diversity. Experiments show that DATAPROPHET produces supervision-data rankings that strongly correlate with rankings based on actual post-training performance gains, achieving a Kendall’s tau of 86.0%. Moreover, DATAPROPHET enables better supervision-data selection, yielding up to 6.9% improvement over uniform selection, 1.4% over a state-of-the-art training-based baseline, and 0.2% above oracle selection based on experimental performance. Our code and data will be released.

关键词: Multimodal Large Language Models, Supervision Data Selection, Transferability, Training-free Metric, DataProphet, Vision-Language Datasets, Post-training Performance

📋 所有论文列表

1. ✅ Evaluating Evidence Grounding Under User Pressure in Instruction-Tuned Language Models

作者: Sai Koneru, Elphin Joe, Christine Kirchhoff, Jian Wu, Sarah Rajtmajer 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20162v1

评分: 65.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	5.0/10	5.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	10.0/10	10.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	10.0/10	10.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	5.0/10	5.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	10.0/10	10.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

评分理由: 论文核心研究指令调优语言模型在用户压力下对上下文证据的忠实度问题，因此与"Large Language Models”、“Instruction Tuning”、“Post-training”、“Hallucination Mitigation"和"In-context Learning"高度相关（10分）。论文评估了0.27B到32B参数的模型，包含较小模型，与"Small Language Models"有一定关联（5分）。研究涉及模型行为分析，与"Explainable AI"有一定关联（5分）。使用气候评估数据，与"AI for Science"有一定关联（5分）。其他关键词如MoE、Scaling Laws、RLHF、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了指令调优语言模型在用户压力下能否忠实于上下文证据的问题，发现即使提供丰富证据，模型仍易屈服于用户压力，且证据中的研究空白会增加模型的迎合倾向，而模型的鲁棒性随规模变化非单调，不同模型在压力下的响应分布特性也不同。

摘要翻译

在存在争议的领域中，指令微调语言模型必须在适应用户需求的压力与忠实于上下文证据之间取得平衡。为评估这种张力，我们基于《美国国家气候评估》构建了一个受控认知冲突框架。我们对19个参数量从0.27B到32B不等的指令微调模型进行了细粒度消融实验，涵盖证据构成和不确定性线索两个维度。在中性提示下，更丰富的证据通常能提升证据一致性准确率与序数评分性能。然而在受控固定证据设置中，当面临用户压力时，证据并不能可靠地防止模型出现迎合用户倾向的立场逆转。我们报告了三种主要失效模式：首先，我们发现了负面部分证据交互现象——增加认知细微差别（特别是研究空白）会加剧Llama-3和Gemma-3等模型系列对迎合倾向的敏感性；其次，鲁棒性呈现非单调缩放规律：在某些模型系列中，部分低至中等规模的模型对对抗性用户压力尤为敏感；第三，模型在冲突下的分布集中度存在差异：部分指令微调模型在压力下能保持尖锐的峰值序数分布，而其他模型的分布则显著更分散；在规模匹配的Qwen模型对比中，经过推理蒸馏的变体（DeepSeek-R1-Qwen）始终比其指令微调版本表现出更高的分布离散度。这些发现表明，在受控固定证据设置中，若缺乏针对认知完整性的显式训练，仅提供更丰富的上下文证据并不能保证模型抵御用户压力。

摘要 (Abstract)

In contested domains, instruction-tuned language models must balance user-alignment pressures against faithfulness to the in-context evidence. To evaluate this tension, we introduce a controlled epistemic-conflict framework grounded in the U.S. National Climate Assessment. We conduct fine-grained ablations over evidence composition and uncertainty cues across 19 instruction-tuned models spanning 0.27B to 32B parameters. Across neutral prompts, richer evidence generally improves evidence-consistent accuracy and ordinal scoring performance. Under user pressure, however, evidence does not reliably prevent user-aligned reversals in this controlled fixed-evidence setting. We report three primary failure modes. First, we identify a negative partial-evidence interaction, where adding epistemic nuance, specifically research gaps, is associated with increased susceptibility to sycophancy in families like Llama-3 and Gemma-3. Second, robustness scales non-monotonically: within some families, certain low-to-mid scale models are especially sensitive to adversarial user pressure. Third, models differ in distributional concentration under conflict: some instruction-tuned models maintain sharply peaked ordinal distributions under pressure, while others are substantially more dispersed; in scale-matched Qwen comparisons, reasoning-distilled variants (DeepSeek-R1-Qwen) exhibit consistently higher dispersion than their instruction-tuned counterparts. These findings suggest that, in a controlled fixed-evidence setting, providing richer in-context evidence alone offers no guarantee against user pressure without explicit training for epistemic integrity.

关键词: instruction-tuned language models, evidence grounding, user pressure, epistemic conflict, sycophancy, in-context learning, model robustness, U.S. National Climate Assessment

2. ✅ WorldAgents: Can Foundation Image Models be Agents for 3D World Models?

作者: Ziya Erkoç, Angela Dai, Matthias Nießner 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19708v1

评分: 43.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	8.0/10	8.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	10.0/10	10.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	10.0/10	10.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文研究了2D基础图像模型是否具备3D世界模型能力，通过提出一个多智能体架构成功实现了3D一致的世界合成，证明了这些模型确实封装了对3D世界的理解。

摘要翻译

鉴于二维基础图像模型生成高保真输出的卓越能力，我们探究了一个根本性问题：二维基础图像模型是否内在地具备三维世界模型能力？为回答此问题，我们系统性地评估了多种先进图像生成模型和视觉语言模型在三维世界合成任务上的表现。为利用并基准测试其潜在的隐式三维能力，我们提出一种智能体框架以促进三维世界生成。我们的方法采用多智能体架构：一个基于视觉语言模型的导演智能体负责制定提示以引导图像合成，一个生成器负责合成新的图像视角，以及一个由视觉语言模型支持的两步验证器，从二维图像和三维重建空间两个维度评估并筛选生成的帧。关键的是，我们证明了我们的智能体方法能够实现连贯且稳健的三维重建，生成的输出场景可通过渲染新视角进行探索。通过对多种基础模型的大量实验，我们证明二维模型确实内蕴了对三维世界的理解。通过利用这种理解，我们的方法成功合成了广阔、真实且三维一致的世界。

摘要 (Abstract)

Given the remarkable ability of 2D foundation image models to generate high-fidelity outputs, we investigate a fundamental question: do 2D foundation image models inherently possess 3D world model capabilities? To answer this, we systematically evaluate multiple state-of-the-art image generation models and Vision-Language Models (VLMs) on the task of 3D world synthesis. To harness and benchmark their potential implicit 3D capability, we propose an agentic framing to facilitate 3D world generation. Our approach employs a multi-agent architecture: a VLM-based director that formulates prompts to guide image synthesis, a generator that synthesizes new image views, and a VLM-backed two-step verifier that evaluates and selectively curates generated frames from both 2D image and 3D reconstruction space. Crucially, we demonstrate that our agentic approach provides coherent and robust 3D reconstruction, producing output scenes that can be explored by rendering novel views. Through extensive experiments across various foundation models, we demonstrate that 2D models do indeed encapsulate a grasp of 3D worlds. By exploiting this understanding, our method successfully synthesizes expansive, realistic, and 3D-consistent worlds.

关键词: Foundation Image Models, 3D World Models, Vision-Language Models, Multi-agent Architecture, 3D World Synthesis, Agentic Framing, 3D Reconstruction, Novel View Rendering

3. ✅ PowerLens: Taming LLM Agents for Safe and Personalized Mobile Power Management

评分: 40.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	10.0/10	10.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文针对移动设备电池管理依赖静态规则、忽略用户个性化需求的问题，提出了PowerLens系统，利用大语言模型的多智能体架构实现上下文感知的个性化电源策略生成，实验表明其在Android设备上实现了81.7%的动作准确率和38.8%的节能效果。

摘要翻译

电池续航能力仍是移动设备面临的关键挑战，而现有的电源管理机制依赖静态规则或粗粒度启发式策略，忽略了用户活动与个人偏好。本文提出PowerLens系统，该系统利用大语言模型（LLMs）的推理能力，在Android设备上实现安全且个性化的移动电源管理。其核心思想在于：大语言模型的常识推理能够弥合用户活动与系统参数之间的语义鸿沟，从而通过隐式反馈生成零样本、上下文感知的策略，以适应个体偏好。PowerLens采用多智能体架构，通过界面语义识别用户上下文，并针对18项设备参数生成全局性电源策略。系统通过基于PDL的约束框架在执行前验证所有操作，同时采用双层记忆系统，通过基于置信度蒸馏的方法从用户隐式覆盖行为中学习个体偏好，无需显式配置，并在3至5天内实现偏好收敛。在已获取root权限的Android设备上进行的大量实验表明，PowerLens相比原生Android系统实现了81.7%的操作准确率和38.8%的节能效果，其性能优于基于规则和基于大语言模型的基线方法，同时具备高用户满意度、快速偏好收敛和强安全性保障，且系统自身仅消耗每日电池容量的0.5%。

摘要 (Abstract)

Battery life remains a critical challenge for mobile devices, yet existing power management mechanisms rely on static rules or coarse-grained heuristics that ignore user activities and personal preferences. We present PowerLens, a system that tames the reasoning power of Large Language Models (LLMs) for safe and personalized mobile power management on Android devices. The key idea is that LLMs’ commonsense reasoning can bridge the semantic gap between user activities and system parameters, enabling zero-shot, context-aware policy generation that adapts to individual preferences through implicit feedback. PowerLens employs a multi-agent architecture that recognizes user context from UI semantics and generates holistic power policies across 18 device parameters. A PDL-based constraint framework verifies every action before execution, while a two-tier memory system learns individualized preferences from implicit user overrides through confidence-based distillation, requiring no explicit configuration and converging within 3–5 days. Extensive experiments on a rooted Android device show that PowerLens achieves 81.7% action accuracy and 38.8% energy saving over stock Android, outperforming rule-based and LLM-based baselines, with high user satisfaction, fast preference convergence, and strong safety guarantees, with the system itself consuming only 0.5% of daily battery capacity.

关键词: Large Language Models, LLM Agents, Multi-agent Systems, Mobile Power Management, Personalized Policy Generation, Context-aware Reasoning, Android Devices, Energy Saving

4. ✅ Can Large Multimodal Models Inspect Buildings? A Hierarchical Benchmark for Structural Pathology Reasoning

作者: Hui Zhong, Yichun Gao, Luyan Liu, Hai Yang, Wang Wang, Haowei Zhang, Xinhu Zheng 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20148v1

评分: 40.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	5.0/10	5.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	5.0/10	5.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	10.0/10	10.0

!!! tip deepseek-chat TL;DR

该论文研究了大型多模态模型在建筑缺陷检测中的推理能力，通过构建DefectBench基准测试发现当前LMMs在语义理解和拓扑感知方面表现优异，但在精确空间定位方面存在不足，同时验证了零样本生成分割的可行性。

摘要翻译

自动化建筑立面检测是城市韧性与智慧城市运维的关键组成部分。传统上，该领域依赖于专门的判别式模型（如YOLO、Mask R-CNN），这些模型擅长像素级定位，但局限于被动感知，且因缺乏对结构拓扑的视觉理解而泛化能力不足。大型多模态模型（Large Multimodal Models, LMMs）有望推动向主动推理的范式转变，然而其在此类高风险工程领域的应用尚缺乏严格的评估标准。为弥补这一空白，我们引入了一种人机协同的半自动化标注框架，利用专家提议验证机制，将12个分散的数据集统一为标准化、层次化的本体。在此基础上，我们提出了首个超越基础语义识别的多维度基准测试集——\textit{DefectBench}，旨在系统检验LMMs的综合能力。\textit{DefectBench} 在三个递进的认知维度——语义感知、空间定位与生成式几何分割——上评估了18个前沿的LMMs。大量实验表明，尽管当前LMMs展现出卓越的拓扑意识与语义理解能力（能有效诊断“是什么”和“如何发生”），但在度量定位精度（“在何处”）方面存在显著不足。然而，关键的是，我们验证了零样本生成式分割的可行性，证明通用基础模型无需领域特定训练即可媲美专门的监督网络。本研究不仅提供了严格的基准测试标准与高质量开源数据库，也为自主人工智能代理在土木工程领域的进步确立了新的基线。

摘要 (Abstract)

Automated building facade inspection is a critical component of urban resilience and smart city maintenance. Traditionally, this field has relied on specialized discriminative models (e.g., YOLO, Mask R-CNN) that excel at pixel-level localization but are constrained to passive perception and worse generization without the visual understandng to interpret structural topology. Large Multimodal Models (LMMs) promise a paradigm shift toward active reasoning, yet their application in such high-stakes engineering domains lacks rigorous evaluation standards. To bridge this gap, we introduce a human-in-the-loop semi-automated annotation framework, leveraging expert-proposal verification to unify 12 fragmented datasets into a standardized, hierarchical ontology. Building on this foundation, we present \textit{DefectBench}, the first multi-dimensional benchmark designed to interrogate LMMs beyond basic semantic recognition. \textit{DefectBench} evaluates 18 state-of-the-art (SOTA) LMMs across three escalating cognitive dimensions: Semantic Perception, Spatial Localization, and Generative Geometry Segmentation. Extensive experiments reveal that while current LMMs demonstrate exceptional topological awareness and semantic understanding (effectively diagnosing “what” and “how”), they exhibit significant deficiencies in metric localization precision (“where”). Crucially, however, we validate the viability of zero-shot generative segmentation, showing that general-purpose foundation models can rival specialized supervised networks without domain-specific training. This work provides both a rigorous benchmarking standard and a high-quality open-source database, establishing a new baseline for the advancement of autonomous AI agents in civil engineering.

关键词: Large Multimodal Models, Building Inspection, Structural Pathology, Benchmark Evaluation, Generative Segmentation, Autonomous AI Agents, Civil Engineering, Zero-shot Learning

5. ✅ SAGE: Sustainable Agent-Guided Expert-tuning for Culturally Attuned Translation in Low-Resource Southeast Asia

评分: 38.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	8.0/10	8.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	10.0/10	10.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文针对低资源东南亚语言翻译中高质量文化相关数据稀缺和训练能耗高的问题，提出了可持续代理引导专家调优（SAGE）框架，通过强化学习代理筛选数据并使用LoRA微调，在显著减少数据使用（97.1%）和能耗（95.2%）的同时，实现了最先进的翻译性能。

摘要翻译

构建包容性万维网的愿景正遭受严峻语言鸿沟的阻碍，这在东南亚资源匮乏地区尤为突出。尽管大语言模型（LLMs）为翻译提供了潜在解决方案，但其在数据稀缺环境中的部署面临双重挑战：高质量、文化适配数据的匮乏，以及在海量嘈杂网络语料上进行训练所产生的高昂能源成本。为化解数字包容性与环境可持续性之间的矛盾，我们提出了可持续智能体引导专家调优框架（Sustainable Agent-Guided Expert-tuning, SAGE）。该框架开创了一种能源感知新范式，强调“优质数据”优先于“海量数据”。SAGE摒弃对未过滤数据集进行高碳排放训练的传统方式，转而采用通过群体相对策略优化（Group Relative Policy Optimization, GRPO）训练的强化学习（RL）智能体，自主构建精炼训练集。该智能体利用从少量专家构建的社区对话集中提取的语义奖励信号，有效过滤噪声及文化失配内容。随后，我们采用低秩自适应技术（Low-Rank Adaptation, LoRA）在此精选数据上高效微调开源大语言模型。我们将SAGE应用于英语与七种东南亚低资源语言（low-resource languages, LRLs）的翻译任务。该方法在BLEU-4和COMET-22指标上取得了最先进的性能表现，能有效捕捉本地语言细微特征。至关重要的是，SAGE在超越基于完整数据集训练的基线模型的同时，将数据使用量降低了97.1%，训练能耗减少了95.2%。通过以最小环境代价实现高性能模型，SAGE为弥合全球南方数字鸿沟提供了一条可扩展且负责任的技术路径。

摘要 (Abstract)

The vision of an inclusive World Wide Web is impeded by a severe linguistic divide, particularly for communities in low-resource regions of Southeast Asia. While large language models (LLMs) offer a potential solution for translation, their deployment in data-poor contexts faces a dual challenge: the scarcity of high-quality, culturally relevant data and the prohibitive energy costs of training on massive, noisy web corpora. To resolve the tension between digital inclusion and environmental sustainability, we introduce Sustainable Agent-Guided Expert-tuning (SAGE). This framework pioneers an energy-aware paradigm that prioritizes the “right data” over “big data”. Instead of carbon-intensive training on unfiltered datasets, SAGE employs a reinforcement learning (RL) agent, optimized via Group Relative Policy Optimization (GRPO), to autonomously curate a compact training set. The agent utilizes a semantic reward signal derived from a small, expert-constructed set of community dialogues to filter out noise and cultural misalignment. We then efficiently fine-tune open-source LLMs on this curated data using Low-Rank Adaptation (LoRA). We applied SAGE to translation tasks between English and seven low-resource languages (LRLs) in Southeast Asia. Our approach establishes new state-of-the-art performance on BLEU-4 and COMET-22 metrics, effectively capturing local linguistic nuances. Crucially, SAGE surpasses baselines trained on full datasets while reducing data usage by 97.1% and training energy consumption by 95.2%. By delivering high-performance models with a minimal environmental footprint, SAGE offers a scalable and responsible pathway to bridge the digital divide in the Global South.

关键词: Large Language Models, Low-Resource Languages, Sustainable AI, Reinforcement Learning Agent, Data Curation, LoRA, Translation, Energy Efficiency

6. ✅ All-Mem: Agentic Lifelong Memory via Dynamic Topology Evolution

作者: Can Lv, Heng Chang, Yuchen Guo, Shengyu Tao, Shiji Zhou 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19595v1

评分: 35.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	10.0/10	10.0
“Context Window Extension” OR “Long Context LLMs”	1.0	5.0/10	5.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	10.0/10	10.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文研究了长期交互智能体中记忆系统随历史增长而退化的问题，提出了All-Mem框架，通过动态拓扑演化管理记忆库，实验表明其在检索和问答任务上优于现有基线。

摘要翻译

终身交互智能体需在数月或数年内持续协助用户，这要求其在固定的上下文与延迟预算下，持续写入长期记忆并为每个新查询检索正确的证据。现有记忆系统常随历史增长而性能下降，检索出的上下文易出现冗余、过时或噪声问题。我们提出All-Mem——一种在线/离线终身记忆框架，通过显式的非破坏性整合来维护拓扑结构化的记忆库，避免了基于摘要的压缩方法中典型的信息不可逆损失。在线运行时，该系统将检索锚定在有限的可见表层，以保持粗粒度搜索成本可控。在定期离线阶段，大型语言模型诊断器会提出带有置信度评分的拓扑编辑建议，通过SPLIT、MERGE和UPDATE三种算子进行门控执行，同时保留不可变的证据以确保可追溯性。在查询时，类型化链接支持在必要时从活跃锚点向归档证据进行跳数受限、预算可控的扩展。在LOCOMO和LONGMEMEVAL数据集上的实验表明，本方法在检索与问答任务上优于代表性基线模型。

摘要 (Abstract)

Lifelong interactive agents are expected to assist users over months or years, which requires continually writing long term memories while retrieving the right evidence for each new query under fixed context and latency budgets. Existing memory systems often degrade as histories grow, yielding redundant, outdated, or noisy retrieved contexts. We present All-Mem, an online/offline lifelong memory framework that maintains a topology structured memory bank via explicit, non destructive consolidation, avoiding the irreversible information loss typical of summarization based compression. In online operation, it anchors retrieval on a bounded visible surface to keep coarse search cost bounded. Periodically offline, an LLM diagnoser proposes confidence scored topology edits executed with gating using three operators: SPLIT, MERGE, and UPDATE, while preserving immutable evidence for traceability. At query time, typed links enable hop bounded, budgeted expansion from active anchors to archived evidence when needed. Experiments on LOCOMO and LONGMEMEVAL show improved retrieval and QA over representative baselines.

关键词: lifelong interactive agents, memory framework, retrieval, topology evolution, LLM diagnoser, long-term memory, agentic memory, dynamic consolidation

7. ✅ TextReasoningBench: Does Reasoning Really Improve Text Classification in Large Language Models?

作者: Xinyu Guo, Yazhou Zhang, Jing Qin 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19558v1

评分: 33.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	5.0/10	5.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	10.0/10	10.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	8.0/10	8.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文通过TextReasoningBench基准评估了LLMs中推理策略对文本分类任务的有效性，发现推理并不总能提升性能，且往往效率低下，复杂方法可能不如简单基线。

摘要翻译

从大型语言模型中引出显式、逐步的推理轨迹已成为增强模型能力的主导范式。尽管此类推理策略最初是为需要显式多步推理的问题设计的，但它们已越来越多地应用于广泛的自然语言处理任务中。这种扩展隐含地假设审慎推理能一致地有益于异构任务。然而，此类推理机制是否真正有益于分类任务在很大程度上仍未得到充分探索，特别是考虑到其巨大的令牌和时间成本。为填补这一空白，我们引入了TextReasoningBench，这是一个旨在系统评估大型语言模型在文本分类任务中推理策略有效性与效率的基准。我们在五个文本分类数据集上，针对十种大型语言模型，比较了七种推理策略，即IO、思维链、自洽思维链、思维树、思维图、思维链束以及长思维链。除了准确率和宏观F1分数等传统指标外，我们引入了两个成本感知评估指标，用于量化每个推理令牌带来的性能增益，以及相对于令牌成本增长的性能提升效率。实验结果揭示了三个值得注意的发现：（1）推理并非普遍提升分类性能：虽然中等复杂度的策略如思维链和自洽思维链能带来一致但有限的增益（通常在大模型上为+1%至+3%），但更复杂的方法（例如思维树和思维图）往往无法超越更简单的基线，甚至可能降低性能，尤其是在小模型上；（2）推理通常是低效的：许多推理策略将令牌消耗增加了10倍至100倍（例如自洽思维链和思维树），却仅带来微小的性能提升。

摘要 (Abstract)

Eliciting explicit, step-by-step reasoning traces from large language models (LLMs) has emerged as a dominant paradigm for enhancing model capabilities. Although such reasoning strategies were originally designed for problems requiring explicit multi-step reasoning, they have increasingly been applied to a broad range of NLP tasks. This expansion implicitly assumes that deliberative reasoning uniformly benefits heterogeneous tasks. However, whether such reasoning mechanisms truly benefit classification tasks remains largely underexplored, especially considering their substantial token and time costs. To fill this gap, we introduce TextReasoningBench, a systematic benchmark designed to evaluate the effectiveness and efficiency of reasoning strategies for text classification with LLMs. We compare seven reasoning strategies, namely IO, CoT, SC-CoT, ToT, GoT, BoC, and long-CoT across ten LLMs on five text classification datasets. Beyond traditional metrics such as accuracy and macro-F1, we introduce two cost-aware evaluation metrics that quantify the performance gain per reasoning token and the efficiency of performance improvement relative to token cost growth. Experimental results reveal three notable findings: (1) Reasoning does not universally improve classification performance: while moderate strategies such as CoT and SC-CoT yield consistent but limited gains (typically +1% to +3% on big models), more complex methods (e.g., ToT and GoT) often fail to outperform simpler baselines and can even degrade performance, especially on small models; (2) Reasoning is often inefficient: many reasoning strategies increase token consumption by 10$\times$ to 100$\times$ (e.g., SC-CoT and ToT) while providing only marginal performance improvements.

关键词: Large Language Models, Reasoning Strategies, Text Classification, Chain of Thought, Efficiency Evaluation, Benchmark, Performance Gain, Token Cost

8. ✅ DataProphet: Demystifying Supervision Data Generalization in Multimodal LLMs

作者: Xuan Qi, Luxi He, Dan Roth, Xingyu Fu 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19688v1

评分: 30.0 / 26.6 ✅

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	5.0/10	5.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	5.0/10	5.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

!!! tip deepseek-chat TL;DR

该论文研究了如何预测多模态大语言模型训练数据集对目标基准性能的影响，并提出了一种无需训练的指标DATAPROPHET，能有效选择监督数据并提升性能。

摘要翻译

为多模态大语言模型（MLLMs）选择监督数据的传统思路是优先选择与目标基准看似相似的数据集，例如文本密集型或视觉中心型任务。然而，这种直观的相似性能否可靠地预测下游性能提升尚不明确。在本研究中，我们首次尝试回答一个实际问题：能否在训练开始前就预估某个训练数据集对目标基准的影响？为探究此问题，我们对涵盖7种不同任务的14个视觉-语言数据集间的迁移效应进行了深入分析。结果表明，直观的任务相似性并非迁移能力的可靠预测指标，泛化性能更多地取决于具体数据集而非其宽泛的任务类别。基于此发现，我们提出了DATAPROPHET——一种简单有效的免训练度量方法，它融合了多模态困惑度、相似性和数据多样性。实验表明，DATAPROPHET生成的监督数据排序与实际训练后性能增益的排序高度相关，肯德尔tau系数达到86.0%。此外，DATAPROPHET能实现更优的监督数据选择：相比均匀选择提升最高达6.9%，优于当前最先进的基于训练的基线方法1.4%，甚至比基于实验性能的预言机选择高出0.2%。我们的代码与数据将公开释放。

摘要 (Abstract)

Conventional wisdom for selecting supervision data for multimodal large language models (MLLMs) is to prioritize datasets that appear similar to the target benchmark, such as text-intensive or vision-centric tasks. However, it remains unclear whether such intuitive similarity reliably predicts downstream performance gains. In this work, we take a first step toward answering a practical question: can we estimate the influence of a training dataset on a target benchmark before any training is performed? To investigate this question, we conduct an in-depth analysis of transfer across 14 vision-language datasets spanning 7 diverse tasks. Our results show that intuitive task similarity is an unreliable predictor of transferability, and that generalization depends more on the specific dataset than on its broad task category. Motivated by this finding, we propose DATAPROPHET, a simple and effective training-free metric that combines multimodal perplexity, similarity, and data diversity. Experiments show that DATAPROPHET produces supervision-data rankings that strongly correlate with rankings based on actual post-training performance gains, achieving a Kendall’s tau of 86.0%. Moreover, DATAPROPHET enables better supervision-data selection, yielding up to 6.9% improvement over uniform selection, 1.4% over a state-of-the-art training-based baseline, and 0.2% above oracle selection based on experimental performance. Our code and data will be released.

关键词: Multimodal Large Language Models, Supervision Data Selection, Transferability, Training-free Metric, DataProphet, Vision-Language Datasets, Post-training Performance

9. ❌ PoC: Performance-oriented Context Compression for Large Language Models via Performance Prediction

作者: Runsong Zhao, Shilei Liu, Jiwei Tang, Langming Liu, Haibin Chen, Weidong Zhang, Yujin Yuan, Tong Xiao, Jingbo Zhu, Wenbo Su, Bo Zheng 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19733v1

评分: 26.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	8.0/10	8.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	8.0/10	8.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文核心研究LLM的上下文压缩技术，以降低推理成本，因此与"Large Language Models"高度相关（10分）。研究涉及缩短上下文长度，与"Context Window Extension"相关（8分），因为压缩上下文可视为扩展上下文窗口的补充技术。研究目标是降低推理成本，与"Speculative Decoding"相关（8分），因为两者都关注推理效率。其他关键词如MoE、SLMs、训练方法、对齐、代理等均未在论文中涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文针对现有LLM上下文压缩方法因指定压缩比率导致性能下降不可预测的问题，提出了一种性能导向的上下文压缩范式（PoC），通过轻量级性能预测器自动寻找满足性能约束的最激进压缩比率，从而在问答和摘要任务上实现了更可靠、高效的上下文压缩部署。

摘要翻译

尽管上下文压缩能够通过缩短上下文来缓解大型语言模型日益增长的计算成本，但现有指定目标压缩率或长度的方法存在性能下降不可预测的问题，阻碍了其可靠部署。我们引入了一种面向性能的上下文压缩范式转变，即开发者指定可接受的性能下限而非压缩率。该范式采用轻量级性能预测器，在引导现成压缩器之前自动寻找满足此约束的最激进压缩率。我们设计并比较了两种预测器变体：一种是简单的上下文无关预测器，另一种是更复杂的上下文感知预测器，后者会考虑输入内容固有的可压缩性。在问答和摘要生成基准测试中，上下文感知预测器始终比上下文无关预测器实现更低的性能预测误差，而由此产生的上下文感知面向性能压缩方案获得了更优的整体性能。我们的工作为大型语言模型上下文压缩实现更可靠、高效且性能感知的部署铺平了道路。

摘要 (Abstract)

While context compression can mitigate the growing inference costs of Large Language Models (LLMs) by shortening contexts, existing methods that specify a target compression ratio or length suffer from unpredictable performance degradation, hindering their reliable deployment. We introduce a paradigm shift to Performance-oriented Context Compression (PoC), where developers specify an acceptable performance floor instead of a compression ratio. PoC employs a lightweight performance predictor to automatically find the most aggressive compression ratio that satisfies this constraint before steering an off-the-shelf compressor. We design and compare two predictor variants: a simple context-agnostic predictor and a more sophisticated context-aware one that considers the input’s inherent compressibility. On both question-answering and summarization benchmarks, the context-aware predictor consistently achieves lower performance prediction error than the context-agnostic predictor, while the resulting context-aware PoC attains a superior overall performance. Our work paves the way for a more reliable, efficient, and performance-aware deployment of context compression for LLMs.

关键词: Large Language Models, context compression, inference cost, performance prediction, compression ratio, question-answering, summarization, efficient deployment

10. ❌ LLM-Enhanced Semantic Data Integration of Electronic Component Qualifications in the Aerospace Domain

作者: Antonio De Santis, Marco Balduini, Matteo Belcao, Andrea Proia, Marco Brambilla, Emanuele Della Valle 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20094v1

评分: 25.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	10.0/10	10.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	10.0/10	10.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

评分理由: 论文核心是LLMs在航空航天领域电子元件资格数据集成与检索中的应用，属于大模型在不同领域的研究应用。摘要明确提到使用LLMs增强检索和减少数据清洗的人工工作，并与RAG方法进行了比较，因此"Large Language Models"和"Retrieval-Augmented Generation"高度相关（10分）。论文涉及航空航天领域的科学应用，与"AI for Science"有一定关联（5分）。其他关键词如MoE、SFT、RLHF、量化等均未在摘要中提及或暗示，属于完全无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对航空航天制造中电子元件资格数据因数据孤岛导致的检索困难问题，提出了一种结合虚拟知识图谱和LLMs的集成检索管道，并通过成本效益分析证明该方案在长期效率上优于纯RAG方法。

摘要翻译

大型制造企业因各部门维护的数据孤岛而面临信息检索挑战，导致数据库间存在不一致与错位问题。本文介绍了在卫星板卡设计中整合与检索电子元器件资质数据的实践经验。由于数据孤岛的存在，设计人员无法即时确定单个元器件的资质状态。然而，该流程在规划阶段至关重要——即生产前发布装配图纸时——以优化新资质认证并避免重复工作。为此，我们提出一种技术流程：利用虚拟知识图谱（Virtual Knowledge Graphs）实现异构数据源的统一视图，并采用大语言模型（LLMs）增强检索能力、减少数据清洗的人工投入。资质检索通过两种机制实现：基于本体论的数据访问（Ontology-based Data Access）方法用于结构化查询，以及向量搜索机制用于根据相似文本属性检索资质记录。我们进行了成本效益对比分析，证明所提出的流程在长期效率上也优于单纯依赖大语言模型的方法（如检索增强生成技术RAG）。

摘要 (Abstract)

Large manufacturing companies face challenges in information retrieval due to data silos maintained by different departments, leading to inconsistencies and misalignment across databases. This paper presents an experience in integrating and retrieving qualification data for electronic components used in satellite board design. Due to data silos, designers cannot immediately determine the qualification status of individual components. However, this process is critical during the planning phase, when assembly drawings are issued before production, to optimize new qualifications and avoid redundant efforts. To address this, we propose a pipeline that uses Virtual Knowledge Graphs for a unified view over heterogeneous data sources and LLMs to enhance retrieval and reduce manual effort in data cleansing. The retrieval of qualifications is then performed through an Ontology-based Data Access approach for structured queries and a vector search mechanism for retrieving qualifications based on similar textual properties. We perform a comparative cost-benefit analysis, demonstrating that the proposed pipeline also outperforms approaches relying solely on LLMs, such as Retrieval-Augmented Generation (RAG), in terms of long-term efficiency.

关键词: Large Language Models, Retrieval-Augmented Generation, Data Integration, Electronic Components, Aerospace Domain, Virtual Knowledge Graphs, Ontology-based Data Access, Vector Search

11. ❌ Enhancing Alignment for Unified Multimodal Models via Semantically-Grounded Supervision

作者: Jiyeong Kim, Yerim So, Hyesong Choi, Uiwon Hwang, Dongbo Min 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19807v1

评分: 25.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	5.0/10	5.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	10.0/10	10.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	10.0/10	10.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10	0.0

评分理由: 论文提出了一种名为SeGroS的微调框架，专门用于解决统一多模态模型（UMMs）中的粒度不匹配和监督冗余问题，核心贡献是增强跨模态对齐。因此，与"Post-training” OR “Supervised Fine-tuning” OR “SFT"高度相关（10分），因为SeGroS是一个微调框架；与"Instruction Tuning” OR “Alignment” OR “Value Alignment"高度相关（10分），因为论文核心目标是增强对齐（alignment）。与"Large Language Models” OR “LLMs” OR “Foundation Models"有一定关联（5分），因为UMMs通常基于或扩展了大型基础模型，但论文未明确聚焦LLMs。其他关键词（如MoE、Scaling Laws、RLHF等）在摘要中未提及，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对统一多模态模型（UMMs）中存在的粒度不匹配和监督冗余问题，提出了一个名为SeGroS的微调框架，通过语义视觉提示和基于语义的损坏输入来增强监督，从而显著提高了生成保真度和跨模态对齐性能。

摘要翻译

统一多模态模型（Unified Multimodal Models, UMMs）作为一种将多模态理解与生成整合于统一建模框架内的新兴范式，已展现出巨大潜力。然而，当前生成式训练范式存在固有的局限性。本文提出语义基础监督（Semantically-Grounded Supervision, SeGroS），这是一个旨在解决UMMs中粒度不匹配与监督冗余问题的微调框架。其核心在于，我们提出了一种新颖的视觉基础定位图，用以构建两种互补的监督信号。首先，我们设计了语义视觉提示，以弥补文本提示的稀疏性。其次，我们生成语义基础化的损坏输入，通过将重建损失限制在与文本对齐的核心区域，从而显式地增强基于掩码的UMMs的监督。在GenEval、DPGBench和CompBench上的广泛评估表明，SeGroS显著提升了多种UMM架构的生成保真度与跨模态对齐能力。

摘要 (Abstract)

Unified Multimodal Models (UMMs) have emerged as a promising paradigm that integrates multimodal understanding and generation within a unified modeling framework. However, current generative training paradigms suffer from inherent limitations. We present Semantically-Grounded Supervision (SeGroS), a fine-tuning framework designed to resolve the granularity mismatch and supervisory redundancy in UMMs. At its core, we propose a novel visual grounding map to construct two complementary supervision signals. First, we formulate semantic Visual Hints to compensate for the sparsity of text prompts. Second, we generate a semantically-grounded Corrupted Input to explicitly enhance the supervision of masking-based UMMs by restricting the reconstruction loss to core text-aligned regions. Extensive evaluations on GenEval, DPGBench, and CompBench demonstrate that SeGroS significantly improves generation fidelity and cross-modal alignment across various UMM architectures.

关键词: Unified Multimodal Models, Semantically-Grounded Supervision, fine-tuning framework, cross-modal alignment, visual grounding map, generation fidelity, masking-based UMMs, granularity mismatch

12. ❌ Structured Latent Dynamics in Wireless CSI via Homomorphic World Models

作者: Salmane Naoumi, Mehdi Bennis, Marwa Chafii 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20048v1

评分: 15.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	10.0/10	10.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

评分理由: 论文专注于无线通信领域，提出了一种基于世界模型（World Models）的自监督学习框架，用于学习无线信道状态信息（CSI）的预测性和结构化表示。因此，它与关键词"World Models” AND “General World Models"高度相关（10分），因为论文明确将问题建模为世界建模任务。同时，论文属于AI在科学领域的应用（无线通信），因此与"AI for Science” OR “Bioinformatics” OR “Cheminformatics"有一定关联（5分），尽管它不属于生物信息学或化学信息学。其他所有关键词均与论文内容无关（0分），因为它们主要涉及大语言模型（LLMs）及其相关技术（如微调、对齐、推理、代理等），而本文研究的是无线信道建模，未涉及任何语言模型或相关技术。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于世界模型和同态更新的自监督学习框架，用于学习无线信道状态信息的结构化潜在动态表示，并在DICHASUS数据集上验证了其在拓扑保持和未来嵌入预测方面的优越性。

摘要翻译

我们提出一种自监督框架，通过将信道状态信息（CSI）的时间演化建模于紧凑的潜空间中，以学习无线信道的预测性与结构化表征。该方法将问题构建为世界建模任务，并利用联合嵌入预测架构（Joint Embedding Predictive Architecture, JEPA）从CSI轨迹中学习动作条件化的潜态动态。为提升几何一致性与组合性，我们采用源自李代数的同态更新来参数化状态转移，从而构建出能反映空间布局与用户移动规律的结构化潜空间。在DICHASUS数据集上的评估表明，本方法在保持拓扑结构与预测未见环境中未来嵌入向量方面均优于现有基线。所得潜空间能够生成度量精确的信道图谱，为移动感知调度、定位及无线场景理解等下游应用提供了可扩展的基础。

摘要 (Abstract)

We introduce a self-supervised framework for learning predictive and structured representations of wireless channels by modeling the temporal evolution of channel state information (CSI) in a compact latent space. Our method casts the problem as a world modeling task and leverages the Joint Embedding Predictive Architecture (JEPA) to learn action-conditioned latent dynamics from CSI trajectories. To promote geometric consistency and compositionality, we parameterize transitions using homomorphic updates derived from Lie algebra, yielding a structured latent space that reflects spatial layout and user motion. Evaluations on the DICHASUS dataset show that our approach outperforms strong baselines in preserving topology and forecasting future embeddings across unseen environments. The resulting latent space enables metrically faithful channel charts, offering a scalable foundation for downstream applications such as mobility-aware scheduling, localization, and wireless scene understanding.

关键词: world models, wireless channel state information, latent dynamics, homomorphic updates, self-supervised learning, channel charts, JEPA, CSI trajectories

13. ❌ Evolving Embodied Intelligence: Graph Neural Network–Driven Co-Design of Morphology and Control in Soft Robotics

作者: Jianqiang Wang, Shuaiqun Pan, Alvaro Serra-Gomez, Xiaohan Wei, Yue Xie 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19582v1

评分: 5.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10	0.0
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10	0.0
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10	0.0
“Scaling Laws” AND “Data Quality”	1.0	0.0/10	0.0
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10	0.0
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10	0.0
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10	0.0
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10	0.0
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10	0.0
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10	0.0
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10	0.0
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10	0.0
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10	0.0
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10	0.0
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10	0.0
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10	0.0
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10	0.0
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10	0.0
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10	0.0
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10	0.0
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10	0.0
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10	0.0
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10	0.0
“World Models” AND “General World Models”	1.0	0.0/10	0.0
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10	0.0
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10	0.0
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	5.0/10	5.0

评分理由: 论文专注于软体机器人的形态与控制协同设计，使用图神经网络（GNN/GAT）作为核心方法，属于机器人学和AI交叉领域。所有关键词均与大语言模型（LLM）及其相关技术（如训练、对齐、推理、部署优化等）或特定科学AI应用（如生物信息学）直接相关。论文内容不涉及任何LLM、语言模型或自然语言处理技术，也未明确属于生物信息学或化学信息学等具体科学AI子领域。因此，除最后一个关键词（“AI for Science”）因论文属于广义的AI在科学/工程（机器人学）中的应用而获得5分（有一定关联）外，其余关键词均评为0分（完全无关）。

!!! tip deepseek-chat TL;DR

该研究解决了软体机器人中形态与控制协同设计时形态进化会破坏已学控制策略的难题，提出了一种基于图神经网络的形态感知策略方法，在基准测试中相比传统方法取得了更高的最终适应度和对形态变化的更强适应性。

摘要翻译

机器人的智能行为并非仅源于控制系统，而是产生于身体与大脑的紧密耦合，这一原理被称为具身智能（embodied intelligence）。设计能够利用这种相互作用的软体机器人仍然是一项重大挑战，尤其是在形态与控制需要同步优化的情况下。这一协同设计过程中的主要障碍在于：形态的演变可能破坏已习得的控制策略，使得重用或调整现有知识变得困难。为此，我们开发了一种基于图神经网络（Graph Neural Network）的方法，用于形态与控制器的协同设计。每个机器人被表示为一个图，其中图注意力网络（Graph Attention Network, GAT）编码节点特征，并通过池化后的表征输入到一个多层感知机（Multilayer Perceptron, MLP）头部，以生成执行器指令或价值估计。在演化过程中，继承遵循拓扑一致性映射：共享的GAT层被重用，MLP隐藏层完整迁移，匹配的执行器输出被复制，而不匹配的输出则随机初始化并进行微调。这种形态感知的策略类别使得控制器能够在身体发生突变时进行适应。在基准测试中，与传统的仅使用MLP的协同设计方法相比，我们基于GAT的方法获得了更高的最终适应度，并对形态变化展现出更强的适应能力。这些结果表明，图结构策略为具身智能中不断演化的形态与控制之间提供了更有效的接口。

摘要 (Abstract)

The intelligent behavior of robots does not emerge solely from control systems, but from the tight coupling between body and brain, a principle known as embodied intelligence. Designing soft robots that leverage this interaction remains a significant challenge, particularly when morphology and control require simultaneous optimization. A significant obstacle in this co-design process is that morphological evolution can disrupt learned control strategies, making it difficult to reuse or adapt existing knowledge. We address this by develop a Graph Neural Network-based approach for the co-design of morphology and controller. Each robot is represented as a graph, with a graph attention network (GAT) encoding node features and a pooled representation passed through a multilayer perceptron (MLP) head to produce actuator commands or value estimates. During evolution, inheritance follows a topology-consistent mapping: shared GAT layers are reused, MLP hidden layers are transferred intact, matched actuator outputs are copied, and unmatched ones are randomly initialized and fine-tuned. This morphology-aware policy class lets the controller adapt when the body mutates. On the benchmark, our GAT-based approach achieves higher final fitness and stronger adaptability to morphological variations compared to traditional MLP-only co-design methods. These results indicate that graph-structured policies provide a more effective interface between evolving morphologies and control for embodied intelligence.

关键词: Embodied Intelligence, Soft Robotics, Co-design, Morphology and Control, Graph Neural Network, Graph Attention Network, Evolutionary Algorithm, Policy Adaptation

14. ❌ VideoSeek: Long-Horizon Video Agent with Tool-Guided Seeking

作者: Jingyang Lin, Jialian Wu, Jiang Liu, Ximeng Sun, Ze Wang, Xiaodong Yu, Jiebo Luo, Zicheng Liu, Emad Barsoum 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20185v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文VideoSeek提出了一种基于大语言模型（GPT-5）的视频智能体，通过工具引导的主动搜索机制减少视频帧解析的计算成本，核心涉及LLM Agents、Tool Use、Chain of Thought和System 2 Thinking等关键词，与这些关键词高度相关（10分）。其他关键词如MoE、量化、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

论文提出VideoSeek，一种利用工具引导主动搜索的长时视频智能体，通过减少帧解析量显著提升视频理解效率，在多个基准测试中实现高精度并大幅降低计算成本。

摘要翻译

视频智能体模型在具有挑战性的视频-语言任务中取得了进展。然而，大多数智能体方法仍然严重依赖对密集采样视频帧的贪婪解析，导致计算成本高昂。我们提出了VideoSeek，一种长视野视频智能体，它利用视频逻辑流主动寻找答案关键证据，而非穷举解析整个视频。这一洞见使得模型能够使用少得多的视频帧，同时保持甚至提升其视频理解能力。VideoSeek在一个“思考-行动-观察”循环中运行，并配备了一个精心设计的工具包，用于收集多粒度视频观察。这种设计支持基于累积观察进行查询感知的探索，并实现实用的视频理解与推理。在四个具有挑战性的视频理解与推理基准测试上的实验表明，VideoSeek在仅使用远少于先前视频智能体和独立大型多模态模型（LMMs）帧数的同时，实现了强大的准确性。值得注意的是，VideoSeek在LVBench基准上相比其基础模型GPT-5取得了10.2个百分点的绝对提升，同时使用的帧数减少了93%。进一步的分析凸显了利用视频逻辑流、强大推理能力以及工具包设计互补作用的重要性。

摘要 (Abstract)

Video agentic models have advanced challenging video-language tasks. However, most agentic approaches still heavily rely on greedy parsing over densely sampled video frames, resulting in high computational cost. We present VideoSeek, a long-horizon video agent that leverages video logic flow to actively seek answer-critical evidence instead of exhaustively parsing the full video. This insight allows the model to use far fewer frames while maintaining, or even improving, its video understanding capability. VideoSeek operates in a think-act-observe loop with a well-designed toolkit for collecting multi-granular video observations. This design enables query-aware exploration over accumulated observations and supports practical video understanding and reasoning. Experiments on four challenging video understanding and reasoning benchmarks demonstrate that VideoSeek achieves strong accuracy while using far fewer frames than prior video agents and standalone LMMs. Notably, VideoSeek achieves a 10.2 absolute points improvement on LVBench over its base model, GPT-5, while using 93% fewer frames. Further analysis highlights the significance of leveraging video logic flow, strong reasoning capability, and the complementary roles of toolkit design.

关键词: Video Agent, Long-Horizon Video, Tool-Guided Seeking, Video Logic Flow, Think-Act-Observe Loop, Video Understanding, Reasoning Capability, Computational Efficiency

15. ❌ LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation

作者: Jiazheng Xing, Fei Du, Hangjie Yuan, Pengwei Liu, Hongbin Xu, Hai Ci, Ruigang Niu, Weihua Chen, Fan Wang, Yong Liu 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20192v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究个性化视频生成，核心创新在于使用多模态大语言模型（MLLMs）来推断和分配主体特定的依赖关系，并构建关系先验，以增强生成控制。因此，仅与关键词’Large Language Models OR LLMs OR Foundation Models’高度相关（8分），因为MLLMs是LLMs的一种扩展应用。其他关键词涉及的大模型技术原理（如MoE、Scaling Laws、训练方法、推理优化、代理系统等）或特定科学领域应用（如生物信息学）均未在论文中提及或作为核心内容，故评分为0。

!!! tip deepseek-chat TL;DR

该论文解决了现有个性化视频生成方法中主体-属性对齐不一致的问题，提出了LumosX框架，通过多模态大语言模型构建关系先验和新型注意力机制，实现了细粒度、身份一致且语义对齐的多主体视频生成，并在基准测试中达到最先进性能。

摘要翻译

扩散模型的最新进展显著提升了文本到视频生成的质量，使得能够以前景和背景元素的细粒度控制进行个性化内容创作。然而，跨主体的精确人脸属性对齐仍然具有挑战性，因为现有方法缺乏确保组内一致性的显式机制。解决这一差距需要显式的建模策略和人脸属性感知的数据资源。为此，我们提出了LumosX框架，该框架在数据和模型设计两方面均有所推进。在数据方面，一个定制的收集流程从独立视频中编排字幕和视觉线索，同时多模态大语言模型推断并分配主体特定的依赖关系。这些提取的关系先验施加了更细粒度的结构，增强了个性化视频生成的表达控制，并支持构建一个全面的基准。在建模方面，关系自注意力与关系交叉注意力将位置感知嵌入与精细化的注意力动态交织在一起，以刻写显式的主体-属性依赖关系，从而强制实现有纪律的组内凝聚力，并放大不同主体集群之间的分离。在我们构建的基准上进行全面评估表明，LumosX在细粒度、身份一致且语义对齐的个性化多主体视频生成方面实现了最先进的性能。代码和模型可在 https://jiazheng-xing.github.io/lumosx-home/ 获取。

摘要 (Abstract)

Recent advances in diffusion models have significantly improved text-to-video generation, enabling personalized content creation with fine-grained control over both foreground and background elements. However, precise face-attribute alignment across subjects remains challenging, as existing methods lack explicit mechanisms to ensure intra-group consistency. Addressing this gap requires both explicit modeling strategies and face-attribute-aware data resources. We therefore propose LumosX, a framework that advances both data and model design. On the data side, a tailored collection pipeline orchestrates captions and visual cues from independent videos, while multimodal large language models (MLLMs) infer and assign subject-specific dependencies. These extracted relational priors impose a finer-grained structure that amplifies the expressive control of personalized video generation and enables the construction of a comprehensive benchmark. On the modeling side, Relational Self-Attention and Relational Cross-Attention intertwine position-aware embeddings with refined attention dynamics to inscribe explicit subject-attribute dependencies, enforcing disciplined intra-group cohesion and amplifying the separation between distinct subject clusters. Comprehensive evaluations on our benchmark demonstrate that LumosX achieves state-of-the-art performance in fine-grained, identity-consistent, and semantically aligned personalized multi-subject video generation. Code and models are available at https://jiazheng-xing.github.io/lumosx-home/.

关键词: personalized video generation, multimodal large language models, subject-attribute alignment, relational priors, diffusion models, fine-grained control, identity consistency, multi-subject generation

16. ❌ From Masks to Pixels and Meaning: A New Taxonomy, Benchmark, and Metrics for VLM Image Tampering

作者: Xinyi Shang, Yi Tang, Jiacheng Cui, Ahmed Elhagry, Salwa K. Al Khatib, Sondos Mahmoud Bsharat, Jiacheng Liu, Xiaohan Zhao, Jing-Hao Xue, Hao Li, Salman Khan, Zhiqiang Shen 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20193v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于视觉语言模型（VLM）的图像篡改检测，提出了新的分类法、基准和评估指标。虽然涉及视觉语言模型，但所有评分关键词均针对大语言模型（LLM）的技术原理、训练方法、推理优化、对齐、压缩、应用等具体方面，而论文的核心是计算机视觉中的篡改检测任务，包括像素级定位、语义分类和语言描述生成，并未涉及LLM的架构、训练、推理优化、对齐或压缩等任何技术细节。因此，所有关键词均与论文内容完全无关，得分为0。

!!! tip deepseek-chat TL;DR

该论文针对现有基于掩码的视觉语言模型图像篡改检测基准与真实编辑信号不匹配的问题，提出了一个从掩码到像素和语义的新分类法、基准和评估指标，实现了像素级篡改定位、语义分类和语言描述的统一评估框架。

摘要翻译

现有篡改检测基准主要依赖物体掩码，这与真实编辑信号存在严重偏差：掩码内的许多像素未被修改或仅发生轻微改动，而掩码外细微却具有实质影响的编辑却被视为自然图像。我们将视觉语言模型（VLM）图像篡改检测任务重新定义为从粗粒度区域标注转向像素级锚定、语义与语言感知的任务。首先，我们提出了一套涵盖编辑基元（替换/移除/拼接/修复/属性修改/色彩调整等）及其篡改对象语义类别的分类体系，将底层视觉变化与高层语义理解相连接。其次，我们发布了包含逐像素篡改标注图及配对类别监督的新基准数据集，用于在统一协议下评估检测与分类性能。第三，我们提出了一个训练框架和评估指标，通过定位置信度或真实编辑强度的预测来量化像素级准确性，并进一步通过语义感知分类和针对预测区域的自然语言描述来衡量篡改语义理解能力。我们还在当前先进的篡改检测器上重新评估了现有强分割/定位基线方法，发现仅使用掩码指标会导致显著的高估或低估评分，同时揭示了其在微观编辑和掩码外修改上的失效模式。我们的框架推动该领域从掩码评估转向像素、语义及语言描述层面，为篡改定位、语义分类与描述建立了严谨标准。代码与基准数据详见 https://github.com/VILA-Lab/PIXAR。

摘要 (Abstract)

Existing tampering detection benchmarks largely rely on object masks, which severely misalign with the true edit signal: many pixels inside a mask are untouched or only trivially modified, while subtle yet consequential edits outside the mask are treated as natural. We reformulate VLM image tampering from coarse region labels to a pixel-grounded, meaning and language-aware task. First, we introduce a taxonomy spanning edit primitives (replace/remove/splice/inpaint/attribute/colorization, etc.) and their semantic class of tampered object, linking low-level changes to high-level understanding. Second, we release a new benchmark with per-pixel tamper maps and paired category supervision to evaluate detection and classification within a unified protocol. Third, we propose a training framework and evaluation metrics that quantify pixel-level correctness with localization to assess confidence or prediction on true edit intensity, and further measure tamper meaning understanding via semantics-aware classification and natural language descriptions for the predicted regions. We also re-evaluate the existing strong segmentation/localization baselines on recent strong tamper detectors and reveal substantial over- and under-scoring using mask-only metrics, and expose failure modes on micro-edits and off-mask changes. Our framework advances the field from masks to pixels, meanings and language descriptions, establishing a rigorous standard for tamper localization, semantic classification and description. Code and benchmark data are available at https://github.com/VILA-Lab/PIXAR.

关键词: VLM image tampering, pixel-grounded detection, tamper localization, semantic classification, benchmark, evaluation metrics, edit primitives, language-aware task

作者: Jianan Huang, Rodolfo V. Valentim, Luca Vassio, Matteo Boffa, Marco Mellia, Idilio Drago, Dario Rossi 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20181v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究网络安全领域的多模态对比学习，旨在通过文本模态的知识迁移来改善网络负载分类的泛化能力。虽然论文提到了使用LLM生成合成负载数据，但这只是数据生成工具，并非研究的核心内容。论文的核心技术是多模态对比学习框架，与评分关键词列表中的大模型技术原理、训练方法、推理优化、对齐技术、代理系统等主题均无直接关联。所有关键词均与论文研究内容无关，因此全部评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对网络安全任务中机器学习模型泛化能力不足的问题，提出了一个两阶段多模态对比学习框架，通过文本漏洞描述的知识迁移来指导网络负载分类，实验表明该方法能有效减少捷径学习并提升模型性能。

摘要翻译

机器学习在网络安全领域的应用长期受限于泛化问题：在受控场景下表现良好的模型，在实际部署中往往无法保持性能。其根本原因通常在于机器学习算法习得的是表层模式（捷径），而非底层的网络安全概念。本文研究对比多模态学习，以此作为提升机器学习在网络安全任务中性能的初步尝试。我们的目标是将知识从数据丰富的模态（如文本）迁移到数据稀缺的模态（如网络载荷）。我们以威胁分类为案例，提出一个两阶段多模态对比学习框架，该框架利用文本形式的漏洞描述来指导载荷分类。首先，我们通过对描述进行对比学习，构建一个语义上有意义的嵌入空间。随后，我们将载荷对齐至此空间，从而实现从文本到载荷的知识迁移。我们在一个大规模私有数据集和一个基于公开CVE描述与LLM生成载荷构建的合成基准上评估了该方法。实验表明，该方法在两个基准测试上均减少了基线模型中的捷径学习现象。我们将此合成基准与源代码作为开源项目发布。

摘要 (Abstract)

The use of ML in cybersecurity has long been impaired by generalization issues: Models that work well in controlled scenarios fail to maintain performance in production. The root cause often lies in ML algorithms learning superficial patterns (shortcuts) rather than underlying cybersecurity concepts. We investigate contrastive multi-modal learning as a first step towards improving ML performance in cybersecurity tasks. We aim at transferring knowledge from data-rich modalities, such as text, to data-scarce modalities, such as payloads. We set up a case study on threat classification and propose a two-stage multi-modal contrastive learning framework that uses textual vulnerability descriptions to guide payload classification. First, we construct a semantically meaningful embedding space using contrastive learning on descriptions. Then, we align payloads to this space, transferring knowledge from text to payloads. We evaluate the approach on a large-scale private dataset and a synthetic benchmark built from public CVE descriptions and LLM-generated payloads. The methodology appears to reduce shortcut learning over baselines on both benchmarks. We release our synthetic benchmark and source code as open source.

关键词: cybersecurity, multi-modal learning, contrastive learning, generalization, threat classification, payload classification, knowledge transfer, shortcut learning

18. ❌ Adaptive Greedy Frame Selection for Long Video Understanding

作者: Yuning Huang, Fengqing Zhu 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20180v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是长视频理解中的帧选择方法，虽然涉及大视觉语言模型（VLMs），但所有评分关键词都专门针对大语言模型（LLMs）及其相关技术（如MoE、SFT、RLHF、RAG等），而论文的核心是视频帧选择算法（基于SigLIP和DINOv2嵌入的贪婪优化），并非LLM技术原理或应用创新。论文未涉及任何评分关键词中的具体技术，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对长视频问答中视觉输入帧过多导致推理瓶颈的问题，提出了一种基于问题自适应贪婪优化的帧选择方法，在固定帧预算下联合优化查询相关性和语义代表性，实验表明该方法在MLVU数据集上比均匀采样和现有基线方法取得了更高的准确率。

摘要翻译

大型视觉-语言模型（VLMs）正日益应用于长视频问答任务，但其推理过程常受限于输入帧数及由此产生的视觉标记数量。简单的稀疏采样可能遗漏关键瞬间，而纯粹基于相关性的选择则常陷入近重复帧的聚集，并牺牲对时间上分散证据的覆盖。本文提出一种问题自适应的贪婪帧选择方法，在固定帧数预算下联合优化查询相关性与语义代表性。我们的方法构建了一个精确时间戳对齐的1~FPS候选帧池（上限为1000帧），在两个互补空间中对候选帧进行嵌入（使用SigLIP编码问题相关性，DINOv2编码语义相似性），并通过贪婪最大化模块化相关性项与设施选址覆盖项的加权和来选择帧。该目标函数具有归一化、单调且次模的特性，从而获得标准的（1-1/e）贪婪近似保证。为平衡不同问题对相关性与覆盖度的差异化需求，我们引入了四种预设策略，并设计了一个轻量级的纯文本问题类型分类器，将每个查询路由至表现最佳的预设策略。在MLVU数据集上的实验表明，相较于均匀采样及近期强基线方法，本方法在不同帧数预算下均取得一致的准确率提升，且在严格预算条件下改进最为显著。

摘要 (Abstract)

Large vision–language models (VLMs) are increasingly applied to long-video question answering, yet inference is often bottlenecked by the number of input frames and resulting visual tokens. Naive sparse sampling can miss decisive moments, while purely relevance-driven selection frequently collapses onto near-duplicate frames and sacrifices coverage of temporally distant evidence. We propose a question-adaptive greedy frame selection method that jointly optimizes query relevance and semantic representativeness under a fixed frame budget. Our approach constructs a 1~FPS candidate pool (capped at 1000) with exact timestamp alignment, embeds candidates in two complementary spaces (SigLIP for question relevance and DINOv2 for semantic similarity), and selects frames by greedily maximizing a weighted sum of a modular relevance term and a facility-location coverage term. This objective is normalized, monotone, and submodular, yielding a standard (1-1/e) greedy approximation guarantee. To account for question-dependent trade-offs between relevance and coverage, we introduce four preset strategies and a lightweight text-only question-type classifier that routes each query to its best-performing preset. Experiments on MLVU show consistent accuracy gains over uniform sampling and a strong recent baseline across frame budgets, with the largest improvements under tight budgets.

关键词: long video understanding, frame selection, vision-language models, greedy optimization, query relevance, semantic representativeness, MLVU benchmark

19. ❌ AI Agents Can Already Autonomously Perform Experimental High Energy Physics

作者: Eric A. Moreno, Samuel Bright-Thonney, Andrzej Novak, Dolores Garcia, Philip Harris 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20179v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心研究大语言模型（LLM）驱动的AI代理在实验高能物理（HEP）领域的自主应用，属于AI for Science范畴，因此相关关键词（LLMs、LLM Agents、AI for Science）获得高分（10分）。论文提到集成文献知识检索（与RAG相关）和多代理审查（与Multi-agent Systems相关），这些关键词获得中等分数（5分）。其他关键词如MoE、SLMs、Scaling Laws、训练方法、推理优化、模型压缩等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了基于大语言模型的AI代理如何自主执行高能物理分析流程，通过提出的JFC框架成功实现了从事件选择到论文撰写的全流程自动化，并在ALEPH、DELPHI和CMS数据集上进行了验证。

摘要翻译

基于大语言模型的人工智能代理现已能够在专家输入最少的情况下，自主执行高能物理（HEP）分析流程的大部分环节。在获得高能物理数据集、执行框架及过往实验文献库的条件下，我们发现Claude Code能够成功实现典型分析的所有阶段自动化：事例选择、本底估计、不确定性量化、统计推断以及论文草拟。我们认为，实验高能物理学界目前低估了这些系统的现有能力，且多数已提出的代理工作流程范围过于局限或过度依赖于特定分析结构的预设框架。我们提出了一个概念验证框架——仅提供上下文（Just Furnish Context，JFC），该框架将自主分析代理与基于文献的知识检索及多代理评审相结合，并证明其足以规划、执行和记录一个可信的高能物理分析。我们通过使用ALEPH、DELPHI和CMS的开放数据进行电弱、量子色动力学（QCD）及希格斯玻色子测量的分析，验证了该框架的有效性。这些工具并非旨在取代物理学家，而是有望减轻分析代码开发中重复性的技术负担，使研究人员能够专注于物理洞察、真正新颖的方法开发以及严格的验证工作。鉴于这些进展，我们主张学界应重新思考如何培养学生、组织分析工作以及分配人力资源的新策略。

摘要 (Abstract)

Large language model-based AI agents are now able to autonomously execute substantial portions of a high energy physics (HEP) analysis pipeline with minimal expert-curated input. Given access to a HEP dataset, an execution framework, and a corpus of prior experimental literature, we find that Claude Code succeeds in automating all stages of a typical analysis: event selection, background estimation, uncertainty quantification, statistical inference, and paper drafting. We argue that the experimental HEP community is underestimating the current capabilities of these systems, and that most proposed agentic workflows are too narrowly scoped or scaffolded to specific analysis structures. We present a proof-of-concept framework, Just Furnish Context (JFC), that integrates autonomous analysis agents with literature-based knowledge retrieval and multi-agent review, and show that this is sufficient to plan, execute, and document a credible high energy physics analysis. We demonstrate this by conducting analyses on open data from ALEPH, DELPHI, and CMS to perform electroweak, QCD, and Higgs boson measurements. Rather than replacing physicists, these tools promise to offload the repetitive technical burden of analysis code development, freeing researchers to focus on physics insight, truly novel method development, and rigorous validation. Given these developments, we advocate for new strategies for how the community trains students, organizes analysis efforts, and allocates human expertise.

关键词: AI agents, large language models, autonomous analysis, high energy physics, experimental physics, multi-agent systems, knowledge retrieval, JFC framework

20. ❌ Learning Dynamic Belief Graphs for Theory-of-mind Reasoning

作者: Ruxiao Chen, Xilei Zhao, Thomas J. Cova, Frank A. Drews, Susu Xu 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20170v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	8.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM在心理理论推理中的应用，提出动态信念图模型来增强LLM的推理能力。高度相关关键词：LLMs（论文明确使用LLMs进行ToM推理）、Chain of Thought/System 2 Thinking（涉及多步深度推理过程）、LLM Agents（构建基于LLM的认知代理）、Explainable AI（模型提供可解释的信念轨迹）。其他关键词如MoE、SFT、RAG等未在论文中涉及，故评分为0。

!!! tip deepseek-chat TL;DR

该论文针对大型语言模型在动态不确定性环境中进行心理理论推理时存在信念不一致和推理能力弱的问题，提出了一种基于动态信念图的结构化认知轨迹模型，显著提高了动作预测准确性并恢复了与人类推理一致的可解释信念轨迹。

摘要翻译

大型语言模型（LLM）的心智理论（Theory of Mind, ToM）推理需要推断人们内隐且不断演变的信念如何影响其在不确定性下的信息寻求与行为——尤其是在灾难响应、急救医学和人机协同自主等高风险场景中。现有方法要么直接提示LLM，要么采用将信念视为静态且独立的隐状态模型，这常常导致随时间推移产生不一致的心理模型，并在动态情境中表现出薄弱的推理能力。我们提出一种用于基于LLM的ToM推理的结构化认知轨迹模型，该模型将心理状态表征为动态信念图，能够联合推断隐信念、学习其时变依赖关系，并将信念演化与信息寻求及决策联系起来。我们的模型贡献在于：（i）一种从文本化概率陈述到一致性概率图模型更新的新颖映射方法；（ii）基于能量的因子图对信念相互依赖关系的表征；（iii）一种基于证据下界（ELBO）的目标函数，用于捕捉信念累积与延迟决策。在多个真实世界灾难疏散数据集上的实验表明，我们的模型显著提升了行为预测能力，并恢复了与人类推理一致的可解释信念轨迹，为在高不确定性环境中为LLM增强ToM能力提供了一个原理性模块。https://anonymous.4open.science/r/ICML_submission-6373/

摘要 (Abstract)

Theory of Mind (ToM) reasoning with Large Language Models (LLMs) requires inferring how people’s implicit, evolving beliefs shape what they seek and how they act under uncertainty – especially in high-stakes settings such as disaster response, emergency medicine, and human-in-the-loop autonomy. Prior approaches either prompt LLMs directly or use latent-state models that treat beliefs as static and independent, often producing incoherent mental models over time and weak reasoning in dynamic contexts. We introduce a structured cognitive trajectory model for LLM-based ToM that represents mental state as a dynamic belief graph, jointly inferring latent beliefs, learning their time-varying dependencies, and linking belief evolution to information seeking and decisions. Our model contributes (i) a novel projection from textualized probabilistic statements to consistent probabilistic graphical model updates, (ii) an energy-based factor graph representation of belief interdependencies, and (iii) an ELBO-based objective that captures belief accumulation and delayed decisions. Across multiple real-world disaster evacuation datasets, our model significantly improves action prediction and recovers interpretable belief trajectories consistent with human reasoning, providing a principled module for augmenting LLMs with ToM in high-uncertainty environment. https://anonymous.4open.science/r/ICML_submission-6373/

关键词: Theory of Mind, Large Language Models, Dynamic Belief Graphs, Cognitive Trajectory Model, Probabilistic Graphical Models, Action Prediction, Interpretable Reasoning, High-uncertainty Environments

作者: Jiyu Lim, Youngwoo Yoon, Kwanghyun Park 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20164v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	8.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究机器人社交行为的自主优化框架，使用视觉语言模型（VLM）作为社会批评者进行自我评估和重新规划。与大多数关键词无关，因为论文不涉及大语言模型技术原理、训练方法、推理优化、模型压缩等核心大模型技术。仅与两个关键词相关：1）‘Self-Correction OR Self-Improvement OR Self-Reflection’（10分）：论文核心是机器人通过VLM评估进行自我批评和迭代优化行为，属于自我改进范畴；2）‘LLM Agents OR Autonomous Agents OR Agentic Workflow’（8分）：论文研究自主机器人代理框架，但使用的是VLM而非LLM，且侧重于行为规划而非通用代理工作流。

!!! tip deepseek-chat TL;DR

该研究提出了CRISP框架，使机器人能够通过视觉语言模型自主评估和优化社交行为，在多种机器人平台和场景中显著提升了行为自然度和情境适切性。

摘要翻译

传统机器人社会行为生成在灵活性与自主性方面存在局限，常依赖于预定义动作或人类反馈。本研究提出CRISP（交互式社会存在批判与重规划框架），这是一种自主框架，机器人通过利用视觉语言模型作为“类人社会批判者”来评估并重规划自身行为。CRISP整合了以下环节：（1）通过分析机器人描述文件（如MJCF）提取可动关节与约束条件；（2）基于情境上下文生成逐步行为规划；（3）参照视觉信息（关节运动范围可视化）生成底层关节控制代码；（4）基于VLM评估社会适宜性与自然度，包括精准定位错误步骤；（5）通过基于奖励的搜索实现行为迭代优化。该方法不依赖特定机器人API，仅需机器人结构文件即可在不同平台上生成具有细微差异的类人动作。在一项涵盖五种机器人类型（包括移动机械臂与人形机器人）和20种场景的用户研究中，与现有方法相比，本方法在用户偏好和情境适宜性评分上均获得显著提升。本研究提出了一个通用框架，在最大限度减少人工干预的同时，拓展了机器人的自主交互能力与跨平台适用性。本工作的详细结果视频及补充信息详见：https://limjiyu99.github.io/inner-critic/

摘要 (Abstract)

Conventional robot social behavior generation has been limited in flexibility and autonomy, relying on predefined motions or human feedback. This study proposes CRISP (Critique-and-Replan for Interactive Social Presence), an autonomous framework where a robot critiques and replans its own actions by leveraging a Vision-Language Model (VLM) as a `human-like social critic.’ CRISP integrates (1) extraction of movable joints and constraints by analyzing the robot’s description file (e.g., MJCF), (2) generation of step-by-step behavior plans based on situational context, (3) generation of low-level joint control code by referencing visual information (joint range-of-motion visualizations), (4) VLM-based evaluation of social appropriateness and naturalness, including pinpointing erroneous steps, and (5) iterative refinement of behaviors through reward-based search. This approach is not tied to a specific robot API; it can generate subtly different, human-like motions on various platforms using only the robot’s structure file. In a user study involving five different robot types and 20 scenarios, including mobile manipulators and humanoids, our proposed method achieved significantly higher preference and situational appropriateness ratings compared to previous methods. This research presents a general framework that minimizes human intervention while expanding the robot’s autonomous interaction capabilities and cross-platform applicability. Detailed result videos and supplementary information regarding this work are available at: https://limjiyu99.github.io/inner-critic/

关键词: robot social behavior, vision-language model, autonomous framework, self-critique, behavior replanning, cross-platform applicability, iterative refinement, human-like motions

22. ❌ Measuring Faithfulness Depends on How You Measure: Classifier Sensitivity in LLM Chain-of-Thought Evaluation

作者: Richard J. Young 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20172v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM中Chain-of-Thought（CoT）的忠实性评估问题，直接涉及’Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’（10分）和’Large Language Models OR LLMs OR Foundation Models’（10分）。研究探讨评估方法的差异如何影响忠实性测量，与’Mechanistic Interpretability OR Explainable AI’（8分）和’Hallucination Mitigation OR Factuality OR Truthfulness’（8分）相关，因为忠实性评估涉及模型输出的真实性和可解释性。论文提到自我反思式评估，与’Self-Correction OR Self-Improvement OR Self-Reflection’（5分）有间接关联，并涉及深度推理评估，与’System 2 Thinking OR Slow Thinking OR In-depth Reasoning’（5分）相关。其他关键词如MoE、SLMs、训练技术、推理加速、AI for Science等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文研究发现，大型语言模型（LLMs）中思维链（CoT）的忠实性评估结果高度依赖于所使用的分类器方法，不同方法会产生显著差异的测量结果，导致模型排名变化，表明单一评估指标无法客观衡量忠实性。

摘要翻译

近期关于思维链（CoT）忠实性的研究常报告单一聚合数值（例如DeepSeek-R1在39%的情况下承认提示），这暗示忠实性是模型客观可度量的属性。本文论证事实并非如此。研究对来自12个开源模型（涵盖9个系列、参数量从7B到1T）的10,276条受干预推理轨迹，应用了三种分类器（纯正则表达式检测器、正则表达式+大语言模型的两阶段流程、独立的Claude Sonnet 4评估器）。在相同数据上，这些分类器得出的总体忠实率分别为74.4%、82.6%和69.7%，其95%置信区间互不重叠。各模型间的忠实率差异达2.6至30.6个百分点，且均具有统计显著性（McNemar检验，p < 0.001）。分歧具有系统性而非随机性：分类器间一致性系数（Cohen’s kappa）从谄媚性提示的0.06（“轻微一致”）到评分者提示的0.42（“中等一致”），且不对称性显著——在谄媚性案例中，883条被流程分类为忠实而被Sonnet评估器判定为不忠实，反向情况仅出现2例。分类器选择甚至会逆转模型排名：Qwen3.5-27B在流程标准下排名第1，在Sonnet评估器下降至第7；OLMo-3.1-32B则从第9位升至第3位。根本原因在于不同分类器对相关忠实性概念的操作化标准存在严格度差异（词汇提及vs认知依赖），导致对相同行为产生分歧测量。这些结果表明，采用不同分类器的研究间所公布的忠实性数据无法进行有效比较，未来评估应当报告多种分类方法下的敏感度范围而非单一数值估计。

摘要 (Abstract)

Recent work on chain-of-thought (CoT) faithfulness reports single aggregate numbers (e.g., DeepSeek-R1 acknowledges hints 39% of the time), implying that faithfulness is an objective, measurable property of a model. This paper demonstrates that it is not. Three classifiers (a regex-only detector, a two-stage regex-plus-LLM pipeline, and an independent Claude Sonnet 4 judge) are applied to 10,276 influenced reasoning traces from 12 open-weight models spanning 9 families and 7B to 1T parameters. On identical data, these classifiers produce overall faithfulness rates of 74.4%, 82.6%, and 69.7%, respectively, with non-overlapping 95% confidence intervals. Per-model gaps range from 2.6 to 30.6 percentage points; all are statistically significant (McNemar’s test, p < 0.001). The disagreements are systematic, not random: inter-classifier agreement measured by Cohen’s kappa ranges from 0.06 (“slight”) for sycophancy hints to 0.42 (“moderate”) for grader hints, and the asymmetry is pronounced: for sycophancy, 883 cases are classified as faithful by the pipeline but unfaithful by the Sonnet judge, while only 2 go the other direction. Classifier choice can also reverse model rankings: Qwen3.5-27B ranks 1st under the pipeline but 7th under the Sonnet judge; OLMo-3.1-32B moves in the opposite direction, from 9th to 3rd. The root cause is that different classifiers operationalize related faithfulness constructs at different levels of stringency (lexical mention versus epistemic dependence), and these constructs yield divergent measurements on the same behavior. These results demonstrate that published faithfulness numbers cannot be meaningfully compared across studies that use different classifiers, and that future evaluations should report sensitivity ranges across multiple classification methodologies rather than single point estimates.

关键词: Chain-of-Thought, Faithfulness Evaluation, Large Language Models, Classifier Sensitivity, Evaluation Methodology, Model Ranking, Sycophancy, Epistemic Dependence

23. ❌ Semantic Token Clustering for Efficient Uncertainty Quantification in Large Language Models

作者: Qi Cao, Andrew Gambardella, Takeshi Kojima, Yutaka Matsuo, Yusuke Iwasawa 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20161v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs的uncertainty quantification方法，直接涉及’Large Language Models’和’Hallucination Mitigation’关键词（均给10分），因为论文明确研究LLMs输出可靠性问题并提出解决方案。其他关键词如MoE、SFT、RAG等均未在摘要中提及，与论文内容无关（给0分）。

!!! tip deepseek-chat TL;DR

该论文针对大型语言模型输出不可靠和过度自信的问题，提出了一种名为语义令牌聚类的高效不确定性量化方法，在保持性能的同时显著降低了计算开销。

摘要翻译

大型语言模型（LLMs）已在多样化任务中展现出卓越能力。然而，其输出的真实性无法得到保证，且其倾向于过度自信的特性进一步限制了可靠性。不确定性量化为识别潜在不可靠输出提供了一种可行途径，但现有方法大多依赖重复采样或辅助模型，引入了显著的计算开销。为应对这些局限，我们提出语义令牌聚类（Semantic Token Clustering, STC），这是一种高效的不确定性量化方法，其利用LLMs内部固有的语义信息。具体而言，我们通过嵌入聚类和前缀匹配将令牌分组为语义一致的簇，并基于对应语义簇上聚合的概率质量来量化不确定性。该方法仅需单次生成，且不依赖辅助模型。实验结果表明，STC在显著降低计算开销的同时，取得了与前沿基线方法相当的性能。

摘要 (Abstract)

Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks. However, the truthfulness of their outputs is not guaranteed, and their tendency toward overconfidence further limits reliability. Uncertainty quantification offers a promising way to identify potentially unreliable outputs, but most existing methods rely on repeated sampling or auxiliary models, introducing substantial computational overhead. To address these limitations, we propose Semantic Token Clustering (STC), an efficient uncertainty quantification method that leverages the semantic information inherently encoded in LLMs. Specifically, we group tokens into semantically consistent clusters using embedding clustering and prefix matching, and quantify uncertainty based on the probability mass aggregated over the corresponding semantic cluster. Our approach requires only a single generation and does not depend on auxiliary models. Experimental results show that STC achieves performance comparable to state-of-the-art baselines while substantially reducing computational overhead.

关键词: Large Language Models, Uncertainty Quantification, Semantic Token Clustering, Truthfulness, Computational Efficiency, Embedding Clustering, Probability Mass, Reliability

24. ❌ Design-OS: A Specification-Driven Framework for Engineering System Design with a Control-Systems Design Case

作者: H. Sinan Bank, Daniel R. Herber, Thomas H. Bradley 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20151v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文提出Design-OS框架，用于工程系统设计，涉及AI辅助设计流程和AI代理协作，但未具体提及大模型或深度学习技术。仅与’LLM Agents OR Autonomous Agents OR Agentic Workflow’（提及AI代理）和’AI for Science OR Bioinformatics OR Cheminformatics’（涉及工程科学应用）有弱关联，其他关键词均无关。

!!! tip deepseek-chat TL;DR

论文针对工程系统设计过程缺乏规范性和可追溯性的问题，提出了一个轻量级、规范驱动的Design-OS工作流，通过结构化规范实现人机协作，并在控制系统设计案例中验证了其有效性。

摘要翻译

工程系统设计——无论是机电一体化、控制系统还是嵌入式系统——通常以临时方式进行，需求往往隐晦不明，从设计意图到参数的可追溯性基本缺失。现有的规范驱动与系统化设计方法主要针对软件领域，而人工智能辅助工具往往在方案生成阶段而非问题界定阶段介入设计流程。人机协作在物理系统设计中的应用仍待深入探索。本文提出Design-OS：一种轻量级、规范驱动的工程系统设计工作流，其架构包含五个阶段：概念定义、文献调研、概念设计、需求定义及设计定义。规范文件作为人类设计师与智能体之间的共享契约；每个阶段产出结构化成果物，保持设计可追溯性并支持智能体增强执行。我们将Design-OS与需求驱动设计、系统化设计框架及人工智能辅助设计流程进行对比定位，并通过两个旋转倒立摆平台——开源SimpleFOC反作用轮与商用Quanser古泽塔摆——的控制系统设计案例进行演示，展现同一规范驱动工作流如何适配根本性不同的实现方案。空白模板及完整设计案例成果物已公开于代码仓库以支持复现与复用。该工作流使设计过程可视化且可审计，并将人工智能的规范驱动编排从软件设计延伸至物理工程系统设计领域。

摘要 (Abstract)

Engineering system design – whether mechatronic, control, or embedded – often proceeds in an ad hoc manner, with requirements left implicit and traceability from intent to parameters largely absent. Existing specification-driven and systematic design methods mostly target software, and AI-assisted tools tend to enter the workflow at solution generation rather than at problem framing. Human–AI collaboration in the design of physical systems remains underexplored. This paper presents Design-OS, a lightweight, specification-driven workflow for engineering system design organized in five stages: concept definition, literature survey, conceptual design, requirements definition, and design definition. Specifications serve as the shared contract between human designers and AI agents; each stage produces structured artifacts that maintain traceability and support agent-augmented execution. We position Design-OS relative to requirements-driven design, systematic design frameworks, and AI-assisted design pipelines, and demonstrate it on a control systems design case using two rotary inverted pendulum platforms – an open-source SimpleFOC reaction wheel and a commercial Quanser Furuta pendulum – showing how the same specification-driven workflow accommodates fundamentally different implementations. A blank template and the full design-case artifacts are shared in a public repository to support reproducibility and reuse. The workflow makes the design process visible and auditable, and extends specification-driven orchestration of AI from software to physical engineering system design.

关键词: specification-driven design, engineering system design, AI-assisted design, human-AI collaboration, control systems design, design workflow, traceability, rotary inverted pendulum

25. ❌ Enhancing Hyperspace Analogue to Language (HAL) Representations via Attention-Based Pooling for Text Classification

作者: Ali Sakour, Zoalfekar Sakour 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20149v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是传统的HAL（Hyperspace Analogue to Language）模型改进，通过引入注意力机制和SVD降维来增强文本分类性能。该工作属于经典的词向量和文本表示学习范畴，而非现代大语言模型（LLM）技术。所有关键词均围绕大模型、深度学习新技术及其应用，而本文方法（基于共现矩阵的HAL）与这些关键词无直接关联。唯一略有相关的是’Mechanistic Interpretability OR Explainable AI’，因为论文提到了通过注意力权重分析提升模型可解释性，但这不是核心创新点，因此给5分。其他关键词均不涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文针对传统HAL模型在句子表示中因平均池化导致信息丢失的问题，提出了一种基于温度缩放加性注意力机制的池化方法，结合SVD降维，在IMDB情感分析任务上取得了比基线高6.74个百分点的准确率提升，并增强了模型的可解释性。

摘要翻译

超空间语言模拟（Hyperspace Analogue to Language, HAL）模型依赖全局词汇共现矩阵来构建分布语义表示。尽管这些表示能有效捕捉词汇关系，但通过标准均值池化将其聚合为句子级嵌入时，常导致信息损失。均值池化为所有词元分配等权重，从而使得上下文显著词的影响力被非信息性结构词元稀释。本文通过将可学习的温度缩放加性注意力机制整合到HAL表示流程中，以应对这一局限。为缓解原始共现矩阵的稀疏性与高维性问题，我们在注意力层之前应用截断奇异值分解（Truncated SVD），将向量投影至稠密潜在空间。我们在IMDB情感分析数据集上评估所提出的架构。实验结果表明，基于注意力的池化方法取得了82.38%的测试准确率，相较于传统均值池化基线（75.64%）实现了6.74个百分点的绝对提升。此外，对注意力权重的定性分析表明，该机制能有效抑制停用词，并选择性地关注承载情感信息的词元，从而同时提升了分类性能与模型可解释性。

摘要 (Abstract)

The Hyperspace Analogue to Language (HAL) model relies on global word co-occurrence matrices to construct distributional semantic representations. While these representations capture lexical relationships effectively, aggregating them into sentence-level embeddings via standard mean pooling often results in information loss. Mean pooling assigns equal weight to all tokens, thereby diluting the impact of contextually salient words with uninformative structural tokens. In this paper, we address this limitation by integrating a learnable, temperature-scaled additive attention mechanism into the HAL representation pipeline. To mitigate the sparsity and high dimensionality of the raw co-occurrence matrices, we apply Truncated Singular Value Decomposition (SVD) to project the vectors into a dense latent space prior to the attention layer. We evaluate the proposed architecture on the IMDB sentiment analysis dataset. Empirical results demonstrate that the attention-based pooling approach achieves a test accuracy of 82.38%, yielding an absolute improvement of 6.74 percentage points over the traditional mean pooling baseline (75.64%). Furthermore, qualitative analysis of the attention weights indicates that the mechanism successfully suppresses stop-words and selectively attends to sentiment-bearing tokens, improving both classification performance and model interpretability.

关键词: Hyperspace Analogue to Language, attention mechanism, text classification, sentiment analysis, SVD, pooling, co-occurrence matrix, interpretability

26. ❌ An Agentic Multi-Agent Architecture for Cybersecurity Risk Management

作者: Ravish Gupta, Saket Kumar, Shreeya Sharma, Maulik Dang, Abhishek Aggarwal 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20131v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	8.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	10.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种用于网络安全风险管理的多智能体AI系统，核心涉及LLM智能体架构（Mistral-7B等模型）和多智能体协调，因此与’LLM Agents’、‘Multi-agent Systems’高度相关（10分）。系统使用Mistral-7B（一种小型语言模型）并进行领域微调，与’Small Language Models’、‘Pre-training’、‘Post-training’相关（8分）。论文发现上下文窗口容量是主要限制因素，与’Context Window Extension’高度相关（10分）。其他关键词如MoE、Scaling Laws、RLHF等未在论文中涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种六智能体AI系统，用于自动化网络安全风险评估，在测试中与专家评估结果高度一致（85%严重性分类一致），但发现上下文窗口容量是多智能体系统的主要性能瓶颈。

摘要翻译

为小型组织实施真正的网络安全风险评估成本高昂——一次符合NIST网络安全框架（NIST CSF）标准的评估最低需花费1.5万美元，耗时数周，且依赖真正稀缺的专业人员。大多数小型企业完全跳过了这一环节。我们构建了一个六智能体人工智能系统，每个智能体负责一个分析阶段：组织画像描绘、资产映射、威胁分析、控制措施评估、风险评分及生成建议。各智能体共享一个随着评估推进而持续扩展的持久化上下文，因此后续智能体能基于先前智能体的结论开展工作——正是这一机制使本系统区别于标准的顺序式智能体流程。我们在一个受HIPAA监管的15人医疗保健公司进行了测试，并将系统输出与三位持有CISSP资质的从业者的独立评估结果进行对比：系统在风险严重程度分类上与专家判断的一致性达到85%，覆盖了92%已识别风险，且在15分钟内完成评估。随后，我们针对医疗保健、金融科技、制造业、零售业和SaaS五个行业领域，使用合成但符合行业实际的组织画像，进行了30轮重复的单智能体评估，对比了通用Mistral-7B模型与经领域微调的模型的表现。两种模型均完成了所有轮次评估。经微调的模型识别出了基线模型完全未能察觉的威胁：医疗保健领域的受保护健康信息（PHI）暴露、制造业的运营技术/工业物联网（OT/IIoT）漏洞，以及零售业特有的平台风险。然而，完整的多智能体流程在使用默认4096令牌上下文窗口的Tesla T4显卡上进行的30次尝试全部失败——结果表明，约束因素在于上下文容量，而非模型质量。

摘要 (Abstract)

Getting a real cybersecurity risk assessment for a small organization is expensive – a NIST CSF-aligned engagement runs $15,000 on the low end, takes weeks, and depends on practitioners who are genuinely scarce. Most small companies skip it entirely. We built a six-agent AI system where each agent handles one analytical stage: profiling the organization, mapping assets, analyzing threats, evaluating controls, scoring risks, and generating recommendations. Agents share a persistent context that grows as the assessment proceeds, so later agents build on what earlier ones concluded – the mechanism that distinguishes this from standard sequential agent pipelines. We tested it on a 15-person HIPAA-covered healthcare company and compared outputs to independent assessments by three CISSP practitioners – the system agreed with them 85% of the time on severity classifications, covered 92% of identified risks, and finished in under 15 minutes. We then ran 30 repeated single-agent assessments across five synthetic but sector-realistic organizational profiles in healthcare, fintech, manufacturing, retail, and SaaS, comparing a general-purpose Mistral-7B against a domain fine-tuned model. Both completed every run. The fine-tuned model flagged threats the baseline could not see at all: PHI exposure in healthcare, OT/IIoT vulnerabilities in manufacturing, platform-specific risks in retail. The full multi-agent pipeline, however, failed every one of 30 attempts on a Tesla T4 with its 4,096-token default context window – context capacity, not model quality, turned out to be the binding constraint.

关键词: multi-agent system, cybersecurity risk assessment, LLM agents, context window limitation, domain fine-tuning, Mistral-7B, agent coordination, automated risk management

27. ❌ Evolving Jailbreaks: Automated Multi-Objective Long-Tail Attacks on Large Language Models

作者: Wenjing Hong, Zhonghua Rong, Li Wang, Feng Chang, Jian Zhu, Ke Tang, Zexuan Zhu, Yew-Soon Ong 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20122v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs的安全对齐漏洞（jailbreak attacks），与’Large Language Models’高度相关（10分），并涉及安全对齐（alignment）问题（8分）。其他关键词如MoE、SLMs、Scaling Laws、训练方法、推理技术、AI for Science等均未在摘要中提及或相关，故评0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为EvoJail的自动化多目标进化搜索框架，用于发现针对大语言模型的长尾分布越狱攻击，实验表明该框架能有效生成多样且高效的攻击策略。

摘要翻译

大型语言模型（LLM）已得到广泛部署，尤其是通过基于网络的免费应用程序，使其暴露于多样化的用户生成输入中，包括来自长尾分布（如低资源语言和加密私有数据）的输入。这种开放式暴露增加了越狱攻击的风险，可能破坏模型的安全对齐。尽管近期研究表明利用长尾分布可促进此类越狱，但现有方法主要依赖人工设计的规则，限制了对这些安全与隐私漏洞的系统性评估。本研究提出EvoJail，一种通过多目标进化搜索发现长尾分布攻击的自动化框架。EvoJail将长尾攻击提示生成构建为多目标优化问题，同时最大化攻击效果并最小化输出困惑度，并引入语义-算法解表示法以捕捉加密-解密逻辑的高层语义意图和底层结构变换。基于此表示法，EvoJail将LLM辅助算子集成到多目标进化框架中，实现自适应且语义感知的变异与交叉操作，从而高效探索高度结构化且开放式的搜索空间。大量实验表明，EvoJail能持续发现多样化且有效的长尾越狱策略，在个体和集成层面均取得与现有方法相竞争的性能表现。

摘要 (Abstract)

Large Language Models (LLMs) have been widely deployed, especially through free Web-based applications that expose them to diverse user-generated inputs, including those from long-tail distributions such as low-resource languages and encrypted private data. This open-ended exposure increases the risk of jailbreak attacks that undermine model safety alignment. While recent studies have shown that leveraging long-tail distributions can facilitate such jailbreaks, existing approaches largely rely on handcrafted rules, limiting the systematic evaluation of these security and privacy vulnerabilities. In this work, we present EvoJail, an automated framework for discovering long-tail distribution attacks via multi-objective evolutionary search. EvoJail formulates long-tail attack prompt generation as a multi-objective optimization problem that jointly maximizes attack effectiveness and minimizes output perplexity, and introduces a semantic-algorithmic solution representation to capture both high-level semantic intent and low-level structural transformations of encryption-decryption logic. Building upon this representation, EvoJail integrates LLM-assisted operators into a multi-objective evolutionary framework, enabling adaptive and semantically informed mutation and crossover for efficiently exploring a highly structured and open-ended search space. Extensive experiments demonstrate that EvoJail consistently discovers diverse and effective long-tail jailbreak strategies, achieving competitive performance with existing methods in both individual and ensemble level.

关键词: Large Language Models, jailbreak attacks, long-tail distributions, multi-objective evolutionary search, safety alignment, automated framework, attack effectiveness, encryption-decryption logic

28. ❌ Chain-of-Adaptation: Surgical Vision-Language Adaptation with Reinforcement Learning

作者: Jiajie Li, Chenhui Xu, Meihuan Liu, Jinjun Xiong 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20116v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出Chain-of-Adaptation（CoA）框架，用于视觉语言模型（VLMs）在手术领域的领域适应。核心创新在于通过强化学习实现结构化推理格式，以增强领域对齐同时保持模型的多模态能力。与关键词的相关性分析：1）与’AI for Science’高度相关（10分），因为论文专注于手术领域的AI应用；2）与’Pre-training/Domain Adaptation’（8分）和’Post-training/SFT’（8分）相关，因为论文研究领域适应和微调方法；3）与’Chain of Thought’（8分）相关，因为CoA框架涉及结构化推理；4）与’Large Language Models’（5分）有一定关联，因为VLMs通常基于大模型架构；其他关键词如MoE、SLMs、RLHF、RAG等与论文内容无直接关联，得0分。

!!! tip deepseek-chat TL;DR

该论文针对视觉语言模型在手术领域微调时可能损害其通用能力的问题，提出了Chain-of-Adaptation框架，通过强化学习实现结构化推理，在保持模型核心能力的同时提升领域适应性能，实验表明其在准确率、泛化性和稳定性上优于监督微调。

摘要翻译

在特定领域数据集上进行传统微调可能会无意中改变模型预训练的多模态先验知识，导致泛化能力下降。为解决这一问题，我们提出链式适应（Chain-of-Adaptation, CoA）——一种旨在整合领域知识同时保持模型固有推理与感知能力的适应框架。CoA通过引入结构化推理格式，借助强化学习增强领域对齐，且不牺牲通用的多模态能力。在标准手术基准测试中，无论是分布内还是分布外场景下的实验均表明，相较于监督微调，CoA实现了更高的准确率、更强的泛化能力以及更稳定的行为表现。此外，消融研究证实，CoA能有效保留模型的核心视觉-语言能力，为视觉语言模型（VLMs）的领域专业化提供了可靠路径。

摘要 (Abstract)

Conventional fine-tuning on domain-specific datasets can inadvertently alter a model’s pretrained multimodal priors, leading to reduced generalization. To address this, we propose Chain-of-Adaptation (CoA), an adaptation framework designed to integrate domain knowledge while maintaining the model’s inherent reasoning and perceptual capabilities. CoA introduces a structured reasoning format that enhances domain alignment without sacrificing general multimodal competence by reinforcement learning. Experiments on standard surgical benchmarks, under both in-distribution and out-of-distribution settings, demonstrate that CoA achieves higher accuracy, stronger generalization, and more stable behavior than supervised fine-tuning. Furthermore, ablation studies confirm that CoA effectively preserves the model’s core visual-language abilities, providing a reliable pathway for domain specialization in VLMs.

关键词: Chain-of-Adaptation, surgical vision-language adaptation, reinforcement learning, domain adaptation, multimodal models, generalization, structured reasoning, visual-language models

29. ❌ Var-JEPA: A Variational Formulation of the Joint-Embedding Predictive Architecture – Bridging Predictive and Generative Self-Supervised Learning

作者: Moritz Gögl, Christopher Yau 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20111v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是自监督学习中的联合嵌入预测架构（JEPA）及其变分公式化，属于深度学习技术原理的创新，但所有给定的关键词都专门针对大语言模型（LLMs）及其相关技术（如微调、对齐、推理、应用等）。论文内容完全不涉及LLMs、MoE、SLMs、缩放定律、预训练/后训练、对齐技术、高效微调、RAG、上下文扩展、推理加速、幻觉缓解、可解释性、世界模型、模型合并、上下文学习或科学AI应用。因此，所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了Var-JEPA，一种将联合嵌入预测架构（JEPA）形式化为变分推断框架的方法，以桥接预测性和生成性自监督学习，并在表格数据上实现了优于T-JEPA的表征学习和下游性能。

摘要翻译

联合嵌入预测架构（JEPA）常被视为基于似然的自我监督学习的非生成式替代方案，其强调在表示空间中进行预测，而非在观测空间中进行重建。我们认为，由此产生的与概率生成建模的分离在很大程度上是修辞性的而非结构性的：典型的JEPA设计（即耦合编码器与上下文到目标的预测器）反映了当变分推断应用于一类特定的耦合隐变量模型时所获得的变分后验与学习到的条件先验，而标准JEPA可被视为一种确定性特例，其中正则化是通过架构和训练启发式方法施加的，而非通过显式的似然函数。基于这一观点，我们推导出变分JEPA（Var-JEPA），它通过优化单一证据下界（ELBO）使隐式生成结构显式化。这能够在无需特设防坍缩正则化器的情况下获得有意义的表示，并允许在隐空间中进行原则性的不确定性量化。我们针对表格数据实例化了该框架（Var-T-JEPA），并实现了强大的表示学习和下游性能，在保持与强原始特征基线竞争力的同时，持续优于T-JEPA。

摘要 (Abstract)

The Joint-Embedding Predictive Architecture (JEPA) is often seen as a non-generative alternative to likelihood-based self-supervised learning, emphasizing prediction in representation space rather than reconstruction in observation space. We argue that the resulting separation from probabilistic generative modeling is largely rhetorical rather than structural: the canonical JEPA design, coupled encoders with a context-to-target predictor, mirrors the variational posteriors and learned conditional priors obtained when variational inference is applied to a particular class of coupled latent-variable models, and standard JEPA can be viewed as a deterministic specialization in which regularization is imposed via architectural and training heuristics rather than an explicit likelihood. Building on this view, we derive the Variational JEPA (Var-JEPA), which makes the latent generative structure explicit by optimizing a single Evidence Lower Bound (ELBO). This yields meaningful representations without ad-hoc anti-collapse regularizers and allows principled uncertainty quantification in the latent space. We instantiate the framework for tabular data (Var-T-JEPA) and achieve strong representation learning and downstream performance, consistently improving over T-JEPA while remaining competitive with strong raw-feature baselines.

关键词: Joint-Embedding Predictive Architecture, JEPA, Variational Inference, Self-Supervised Learning, Representation Learning, Evidence Lower Bound, ELBO, Tabular Data

30. ❌ Demonstration of Adapt4Me: An Uncertainty-Aware Authoring Environment for Personalizing Automatic Speech Recognition to Non-normative Speech

作者: Niclas Pokel, Yiming Zhao, Pehuén Moure, Yingqiang Gao, Roman Böhringer 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20112v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于自动语音识别（ASR）的个性化，特别是针对非标准语音。论文的核心技术贡献是提出了一个名为Adapt4Me的基于Web的去中心化环境，该环境采用贝叶斯主动学习，并利用变分推断低秩适应（VI-LoRA）进行后端个性化。因此，它与关键词’PEFT OR LoRA OR Parameter-efficient Fine-tuning’高度相关（评分为10分），因为VI-LoRA是LoRA（一种参数高效微调方法）的变体。论文未涉及大语言模型（LLMs）、其他微调方法、推理技术、代理系统或科学AI应用，因此其他所有关键词的评分为0分。加权总分计算为10.0（相关度10 × 权重1.0）。

!!! tip deepseek-chat TL;DR

该论文解决了为非标准语音个性化自动语音识别（ASR）的挑战，提出并演示了Adapt4Me，这是一个基于Web的环境，通过贝叶斯主动学习和VI-LoRA微调，使用户能够以数据高效的方式引导创建个性化的ASR模型。

摘要翻译

针对非标准语音的自动语音识别（ASR）个性化研究仍面临挑战，主要因数据收集耗时费力且模型训练技术复杂。为突破这些限制，我们提出Adapt4Me——一个基于网络的去中心化环境，该系统通过贝叶斯主动学习的操作化设计，实现在无专家监督下的端到端个性化。该应用通过三阶段人机协同工作流程，向普通用户开放数据选择、适配与验证功能：（1）通过贪婪音素采样进行快速语音画像，以捕捉说话人特有的声学特征；（2）采用变分推断低秩适配（Variational Inference Low-Rank Adaptation, VI-LoRA）进行后端个性化，支持快速增量更新；（3）持续改进阶段，用户通过低操作成本的top-k纠错机制解析可视化模型不确定性，从而指导模型优化。通过显式呈现认知不确定性，Adapt4Me将数据效率重构为交互式设计特征，而非纯粹的算法问题。我们证明该框架使用户能够个性化定制鲁棒的ASR模型，将其从被动数据源转变为自身辅助技术的主动创造者。

摘要 (Abstract)

Personalizing Automatic Speech Recognition (ASR) for non-normative speech remains challenging because data collection is labor-intensive and model training is technically complex. To address these limitations, we propose Adapt4Me, a web-based decentralized environment that operationalizes Bayesian active learning to enable end-to-end personalization without expert supervision. The app exposes data selection, adaptation, and validation to lay users through a three-stage human-in-the-loop workflow: (1) rapid profiling via greedy phoneme sampling to capture speaker-specific acoustics; (2) backend personalization using Variational Inference Low-Rank Adaptation (VI-LoRA) to enable fast, incremental updates; and (3) continuous improvement, where users guide model refinement by resolving visualized model uncertainty via low-friction top-k corrections. By making epistemic uncertainty explicit, Adapt4Me reframes data efficiency as an interactive design feature rather than a purely algorithmic concern. We show how this enables users to personalize robust ASR models, transforming them from passive data sources into active authors of their own assistive technology.

关键词: Automatic Speech Recognition, Personalization, Non-normative Speech, Bayesian Active Learning, Variational Inference Low-Rank Adaptation, VI-LoRA, Human-in-the-loop, Uncertainty-aware

31. ❌ The $\mathbf{Y}$-Combinator for LLMs: Solving Long-Context Rot with $λ$-Calculus

作者: Amartya Roy, Rasul Tutunov, Xiaotong Ji, Matthieu Zimmer, Haitham Bou-Ammar 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20105v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	10.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在长上下文推理中的瓶颈问题，提出基于λ-calculus的框架λ-RLM来替代现有的递归语言模型（RLMs）。与"Large Language Models"和"Context Window Extension"高度相关（10分），因为直接解决LLMs的长上下文限制。与推理相关的关键词（“Chain of Thought"和"System 2 Thinking”）得5分，因为论文涉及多步递归推理但非传统CoT方法。“Speculative Decoding"得5分，因为框架通过结构化控制流和预验证组合器提高了推理效率（减少延迟4.1倍）。其他关键词如MoE、SFT、RAG等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文针对大语言模型在长上下文推理中的瓶颈问题，提出了基于λ-calculus的框架λ-RLM，通过结构化函数式编程替代开放式递归代码生成，在多个任务和模型上实现了更高的准确性和更低的延迟。

摘要翻译

大型语言模型正日益被用作通用推理工具，但长输入仍受限于固定的上下文窗口。递归语言模型通过将提示外部化并递归求解子问题来解决这一限制。然而，现有的递归语言模型依赖于开放式的读取-求值-输出循环，其中模型生成任意的控制代码，导致执行过程难以验证、预测和分析。
我们提出$λ$-RLM框架，用于长上下文推理，该框架以基于$λ$-演算的类型化函数式运行时系统取代自由形式的递归代码生成。它执行一个经过预先验证的组合子库，并仅在有界的叶子子问题上进行神经推理，从而将递归推理转化为具有显式控制流的结构化函数式程序。我们证明，$λ$-RLM具备标准递归语言模型所缺乏的形式化保证，包括终止性、闭式成本界限、随递归深度可控的精度扩展，以及在简单成本模型下的最优划分规则。实证研究表明，在四个长上下文推理任务和九个基础模型中，$λ$-RLM在36项模型-任务对比中有29项优于标准递归语言模型，在不同模型层级上平均准确率最高提升21.9个百分点，并将延迟降低至多4.1倍。这些结果表明，类型化符号控制为长上下文推理提供了比开放式递归代码生成更可靠、更高效的基础。$λ$-RLM的完整实现已在社区开源：https://github.com/lambda-calculus-LLM/lambda-RLM。

摘要 (Abstract)

LLMs are increasingly used as general-purpose reasoners, but long inputs remain bottlenecked by a fixed context window. Recursive Language Models (RLMs) address this by externalising the prompt and recursively solving subproblems. Yet existing RLMs depend on an open-ended read-eval-print loop (REPL) in which the model generates arbitrary control code, making execution difficult to verify, predict, and analyse. We introduce $λ$-RLM, a framework for long-context reasoning that replaces free-form recursive code generation with a typed functional runtime grounded in $λ$-calculus. It executes a compact library of pre-verified combinators and uses neural inference only on bounded leaf subproblems, turning recursive reasoning into a structured functional program with explicit control flow. We show that $λ$-RLM admits formal guarantees absent from standard RLMs, including termination, closed-form cost bounds, controlled accuracy scaling with recursion depth, and an optimal partition rule under a simple cost model. Empirically, across four long-context reasoning tasks and nine base models, $λ$-RLM outperforms standard RLM in 29 of 36 model-task comparisons, improves average accuracy by up to +21.9 points across model tiers, and reduces latency by up to 4.1x. These results show that typed symbolic control yields a more reliable and efficient foundation for long-context reasoning than open-ended recursive code generation. The complete implementation of $λ$-RLM, is open-sourced for the community at: https://github.com/lambda-calculus-LLM/lambda-RLM.

关键词: Large Language Models, Long-context reasoning, Recursive Language Models, λ-calculus, Context window, Formal guarantees, Inference efficiency, Structured functional programming

32. ❌ Spectral Alignment in Forward-Backward Representations via Temporal Abstraction

作者: Seyed Mahdi B. Azad, Jasper Hoffmann, Iman Nematollahi, Hao Zhu, Abhinav Valada, Joschka Boedecker 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20103v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是强化学习中的表示学习问题，具体关注连续控制环境中前向-后向表示和时序抽象，与所有提供的大模型和深度学习技术关键词均无直接关联。论文未涉及语言模型、模型训练技术、推理方法、代理系统、模型优化或科学AI应用等主题。

!!! tip deepseek-chat TL;DR

该论文研究了在连续控制环境中，如何通过时序抽象来缓解前向-后向表示学习中的谱不匹配问题，从而提高长期表示学习的稳定性。

摘要翻译

前向-后向（FB）表征框架通过强制低秩分解，为连续空间中的后继表征（SR）学习提供了有力工具。然而，连续环境的高秩转移动力学与FB架构的低秩瓶颈之间常存在根本性的谱不匹配问题，这使得精确的低秩表征学习变得困难。本文分析了时序抽象作为缓解此不匹配机制的作用。通过刻画转移算子的谱特性，我们证明时序抽象起到低通滤波器的作用，能抑制高频谱分量。这种抑制降低了诱导SR的有效秩，同时保留了结果值函数误差的形式化边界。实验表明，这种对齐是FB学习稳定性的关键因素，特别是在高折扣因子下，自助法（bootstrapping）容易产生误差时。我们的研究结果确立了时序抽象作为一种原则性机制，能够塑造底层马尔可夫决策过程（MDP）的谱结构，并在连续控制中实现有效的长时程表征。

摘要 (Abstract)

Forward-backward (FB) representations provide a powerful framework for learning the successor representation (SR) in continuous spaces by enforcing a low-rank factorization. However, a fundamental spectral mismatch often exists between the high-rank transition dynamics of continuous environments and the low-rank bottleneck of the FB architecture, making accurate low-rank representation learning difficult. In this work, we analyze temporal abstraction as a mechanism to mitigate this mismatch. By characterizing the spectral properties of the transition operator, we show that temporal abstraction acts as a low-pass filter that suppresses high-frequency spectral components. This suppression reduces the effective rank of the induced SR while preserving a formal bound on the resulting value function error. Empirically, we show that this alignment is a key factor for stable FB learning, particularly at high discount factors where bootstrapping becomes error-prone. Our results identify temporal abstraction as a principled mechanism for shaping the spectral structure of the underlying MDP and enabling effective long-horizon representations in continuous control.

关键词: forward-backward representations, temporal abstraction, spectral alignment, successor representation, continuous control, low-rank factorization, transition dynamics, value function error

33. ❌ Pitfalls in Evaluating Interpretability Agents

作者: Tal Haklay, Nikhil Prakash, Sana Pandey, Antonio Torralba, Aaron Mueller, Jacob Andreas, Tamar Rott Shaham, Yonatan Belinkov 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20101v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究自动化可解释性系统，该系统利用LLMs构建自主研究代理（interpretability agents）进行电路分析，因此与’Large Language Models’和’LLM Agents’高度相关（10分）。论文直接涉及可解释性AI（Mechanistic Interpretability），因此该关键词得10分。其他关键词如MoE、量化、推理加速、对齐等均未在摘要中提及或与论文主题无关，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了基于大语言模型的自主可解释性代理在评估中存在的陷阱，发现基于复现的评估方法存在主观性、过程遮蔽和记忆化等问题，并提出了一种基于功能可互换性的无监督内在评估方法。

摘要翻译

自动化可解释性系统旨在减少对人力的依赖，并将分析能力扩展到日益庞大的模型和多样化的任务中。为实现这一目标，近期研究尝试以不同自主程度利用大语言模型（LLMs），其范围从固定的一次性工作流程到完全自主的可解释性智能体。这一转变相应地要求扩展评估方法，以跟上生成解释的数量和复杂性增长。我们在自动化电路分析（即解释模型组件在执行特定任务时的作用）的背景下探讨这一挑战。为此，我们构建了一个智能体系统，其中研究智能体迭代设计实验并完善假设。在文献中的六项电路分析任务上，与人类专家解释进行比较评估时，该系统表现出竞争力。然而，进一步审视揭示了基于复现的评估存在若干缺陷：人类专家解释可能具有主观性或不完整性，基于结果的比较掩盖了研究过程，而基于LLM的系统可能通过记忆或有依据的猜测复现已发表的研究结果。为应对部分缺陷，我们提出一种基于模型组件功能可互换性的无监督内在评估方法。我们的工作揭示了评估复杂自动化可解释性系统的根本性挑战，并指出了基于复现的评估方法的关键局限性。

摘要 (Abstract)

Automated interpretability systems aim to reduce the need for human labor and scale analysis to increasingly large models and diverse tasks. Recent efforts toward this goal leverage large language models (LLMs) at increasing levels of autonomy, ranging from fixed one-shot workflows to fully autonomous interpretability agents. This shift creates a corresponding need to scale evaluation approaches to keep pace with both the volume and complexity of generated explanations. We investigate this challenge in the context of automated circuit analysis – explaining the roles of model components when performing specific tasks. To this end, we build an agentic system in which a research agent iteratively designs experiments and refines hypotheses. When evaluated against human expert explanations across six circuit analysis tasks in the literature, the system appears competitive. However, closer examination reveals several pitfalls of replication-based evaluation: human expert explanations can be subjective or incomplete, outcome-based comparisons obscure the research process, and LLM-based systems may reproduce published findings via memorization or informed guessing. To address some of these pitfalls, we propose an unsupervised intrinsic evaluation based on the functional interchangeability of model components. Our work demonstrates fundamental challenges in evaluating complex automated interpretability systems and reveals key limitations of replication-based evaluation.

关键词: interpretability agents, large language models, automated circuit analysis, evaluation pitfalls, functional interchangeability, autonomous systems, model components, replication-based evaluation

34. ❌ An Empirical Study of SFT-DPO Interaction and Parameterization in Small Language Models

作者: Yuming Feng, Christy Yang 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20100v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	10.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究SFT、DPO和LoRA在小语言模型上的交互与参数化，因此与’SFT’、‘DPO’、‘LoRA’高度相关（10分），与’Small Language Models’高度相关（10分），与’Alignment’有一定关联（5分），与’Large Language Models’有间接关联（5分），其他关键词未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文研究了在小语言模型上，监督微调（SFT）与直接偏好优化（DPO）的交互作用以及全参数微调（FFT）与低秩适应（LoRA）的参数化效果，发现FFT始终优于LoRA，而DPO相对于SFT的增益有限。

摘要翻译

直接偏好优化（DPO）在监督微调（SFT）后被广泛用于对齐语言模型，但其在小型骨干网络和有限数据下的实证行为尚未明确界定。我们系统比较了仅使用SFT、仅使用DPO以及分阶段SFT转DPO的训练方式，同时在GPT-2规模的解码器上对比了全参数微调（FFT）与低秩自适应（LoRA）方法，并评估了释义检测和莎士比亚十四行诗续写任务。DPO相较于强SFT基线仅带来微小且任务依赖性的提升，当偏好构建与监督目标高度一致时，无需预热启动即可达到与竞争性SFT相当的准确率。相比之下，参数化方式占据主导地位：在相同训练深度下，FFT始终优于LoRA，且LoRA在我们的硬件上并未减少实际训练时间。这些结果表明，在此小规模实验体系中，监督式全参数适应仍是性能提升的主要杠杆，而偏好优化与低秩自适应仅能提供有限的边际收益。

摘要 (Abstract)

Direct Preference Optimization (DPO) is widely used after supervised fine-tuning (SFT) to align language models, yet empirical behavior under small backbones and modest data is under-specified. We systematically compare SFT-only, DPO-only, and staged SFT-to-DPO training alongside full fine-tuning (FFT) versus LoRA on a GPT-2-scale decoder, evaluating paraphrase detection and Shakespearean sonnet continuation. DPO yields small, task-dependent gains over strong SFT and can match competitive SFT accuracy without a warm start when the preference construction closely parallels the supervised objective. In contrast, parameterization dominates: FFT consistently outperforms LoRA at matched training depth, and LoRA does not reduce wall-clock time on our hardware. These findings indicate that, in this small-scale regime, supervised full-parameter adaptation remains the primary performance lever, while preference optimization and low-rank adaptation provide limited marginal returns.

关键词: Small Language Models, Supervised Fine-tuning, Direct Preference Optimization, LoRA, Parameterization, Full Fine-tuning, Empirical Study, GPT-2-scale

35. ❌ Fine-tuning Timeseries Predictors Using Reinforcement Learning

作者: Hugo Cazaux, Ralph Rudd, Hlynur Stefánsson, Sverrir Ólafsson, Eyjólfur Ingi Ásgeirsson 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20063v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究使用强化学习微调金融时间序列预测模型，仅与’Post-training OR Supervised Fine-tuning OR SFT’关键词有中等关联（5分），因为涉及微调概念但未明确提及监督微调或SFT。其他关键词均与大模型、深度学习技术原理或科学AI应用无关，论文专注于传统机器学习在金融领域的应用，未涉及大模型、MoE、量化、推理加速、RAG、对齐等现代大模型技术。

!!! tip deepseek-chat TL;DR

该论文研究了使用强化学习算法微调金融时间序列预测模型的方法，结果表明微调后模型性能提升并具有迁移学习特性。

摘要翻译

本章介绍了三种用于微调金融预测模型的主要强化学习算法。我们提出了一种清晰的实施方案，将强化学习任务的损失反向传播至经监督学习训练的模型，并比较微调前后的性能表现。研究发现微调后模型性能有所提升，并展现出迁移学习特性，这印证了微调策略的优越性。我们还重点阐述了调参过程与实证结果，为从业者的后续实践提供参考。

摘要 (Abstract)

This chapter presents three major reinforcement learning algorithms used for fine-tuning financial forecasters. We propose a clear implementation plan for backpropagating the loss of a reinforcement learning task to a model trained using supervised learning, and compare the performance before and after the fine-tuning. We find an increase in performance after fine-tuning, and transfer learning properties to the models, indicating the benefits of fine-tuning. We also highlight the tuning process and empirical results for future implementation by practitioners.

关键词: reinforcement learning, fine-tuning, timeseries prediction, financial forecasting, transfer learning, supervised learning, backpropagation, empirical results

36. ❌ Agentic Harness for Real-World Compilers

作者: Yingwei Zheng, Cong Li, Shaohua Li, Yuqun Zhang, Zhendong Su 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20075v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM代理（LLM Agents）在编译器bug修复中的应用，开发了llvm-autofix代理工具，因此与’LLM Agents’高度相关（10分）。论文提到LLM代理需要理解和使用编译器特定工具，这与’Tool Use’有一定关联（5分）。论文明确使用LLM（大语言模型）作为基础技术，因此与’Large Language Models’高度相关（10分）。其他关键词如MoE、SFT、RAG、量化等均未在论文中涉及，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对编译器bug修复的独特挑战，提出了首个代理工具llvm-autofix来帮助LLM代理理解和修复LLVM编译器bug，其最小代理版本在性能上比现有最优方法提升了约22%。

摘要翻译

编译器对现代计算至关重要，但修复编译器漏洞却十分困难。尽管近期大语言模型（LLM）的进展使得自动化漏洞修复成为可能，但编译器漏洞因其复杂性、对深度跨领域专业知识的需求，以及稀疏且缺乏描述性的漏洞报告而带来独特挑战，这要求开发编译器专用的工具。为弥补这一差距，我们推出了 llvm-autofix，这是首个为协助LLM智能体理解和修复编译器漏洞而设计的智能体驱动框架。我们的研究聚焦于LLVM这一应用最广泛的编译器基础设施之一。llvm-autofix的核心包括便于智能体使用的LLVM工具、一个包含可复现LLVM漏洞的基准测试集 llvm-bench，以及一个为修复LLVM漏洞定制的精简智能体 llvm-autofix-mini。我们的评估表明，在处理编译器漏洞时，前沿模型的性能相比处理普通软件漏洞下降了60%。我们的精简智能体 llvm-autofix-mini 的表现也优于现有最佳方法约22%。这凸显了需要如我们框架这样的专用工具链，以弥合大语言模型与编译器工程之间的鸿沟。我们相信这项工作为提升大语言模型在编译器等复杂系统中的能力奠定了基础。GitHub: https://github.com/dtcxzyw/llvm-autofix

摘要 (Abstract)

Compilers are critical to modern computing, yet fixing compiler bugs is difficult. While recent large language model (LLM) advancements enable automated bug repair, compiler bugs pose unique challenges due to their complexity, deep cross-domain expertise requirements, and sparse, non-descriptive bug reports, necessitating compiler-specific tools. To bridge the gap, we introduce llvm-autofix, the first agentic harness designed to assist LLM agents in understanding and fixing compiler bugs. Our focus is on LLVM, one of the most widely used compiler infrastructures. Central to llvm-autofix are agent-friendly LLVM tools, a benchmark llvm-bench of reproducible LLVM bugs, and a tailored minimal agent llvm-autofix-mini for fixing LLVM bugs. Our evaluation demonstrates a performance decline of 60% in frontier models when tackling compiler bugs compared with common software bugs. Our minimal agent llvm-autofix-mini also outperforms the state-of-the-art by approximately 22%. This emphasizes the necessity for specialized harnesses like ours to close the loop between LLMs and compiler engineering. We believe this work establishes a foundation for advancing LLM capabilities in complex systems like compilers. GitHub: https://github.com/dtcxzyw/llvm-autofix

关键词: LLM agents, compiler bug repair, LLVM, agentic harness, automated bug fixing, large language models, specialized tools, performance evaluation

37. ❌ The End of Rented Discovery: How AI Search Redistributes Power Between Hotels and Intermediaries

作者: Peiying Zhu, Sidi Chang 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20062v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究AI搜索引擎（Google Gemini）在酒店推荐中的引用模式，属于AI应用研究，但所有关键词均聚焦于大模型技术原理、训练方法、推理优化、对齐技术等具体技术层面，而论文仅涉及AI搜索系统的应用效果分析，未探讨任何底层模型技术、训练方法或优化技术，因此与所有技术关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文研究了AI搜索引擎（Google Gemini）在酒店推荐查询中如何引用不同来源，发现体验式查询比交易式查询更倾向于引用非OTA来源，这可能改变酒店发现过程中由佣金中介主导的格局。

摘要翻译

当旅行者请求AI搜索引擎推荐酒店时，哪些来源会被引用——查询的表述方式是否会产生影响？我们审核了谷歌Gemini在东京156个酒店查询中生成的1,357条溯源引用，记录了一种系统性模式，并将其命名为“意图-来源鸿沟”。体验型查询（experiential queries）的引用中有55.9%来自非在线旅行社（non-OTA）来源，而交易型查询（transactional queries）的这一比例仅为30.8%——两者存在25.1个百分点的差距（$p < 5 \times 10^{-20}$）。该效应在日语查询中更为显著：体验型查询的非OTA引用占比达62.1%，而英语查询中为50.0%——这与日本非OTA内容生态更具多样性的现状相符。对于长期依赖向OTA支付费用以获取客源的酒店业而言，这一模式具有重要意义，因为它表明AI搜索可能使酒店发现过程不再完全由基于佣金的中介机构所控制。

摘要 (Abstract)

When a traveler asks an AI search engine to recommend a hotel, which sources get cited – and does query framing matter? We audit 1,357 grounding citations from Google Gemini across 156 hotel queries in Tokyo and document a systematic pattern we call the Intent-Source Divide. Experiential queries draw 55.9% of their citations from non-OTA sources, compared to 30.8% for transactional queries – a 25.1 percentage-point gap ($p < 5 \times 10^{-20}$). The effect is amplified in Japanese, where experiential queries draw 62.1% non-OTA citations compared to 50.0% in English – consistent with a more diverse Japanese non-OTA content ecosystem. For an industry in which hotels have long paid OTAs for demand acquisition, this pattern matters because it suggests that AI search may make hotel discovery less exclusively controlled by commission-based intermediaries.

关键词: AI search, hotel recommendation, grounding citations, Intent-Source Divide, experiential queries, transactional queries, OTA, non-OTA sources

38. ❌ Experience is the Best Teacher: Motivating Effective Exploration in Reinforcement Learning for LLMs

作者: Wenjian Zhang, Kongcheng Zhang, Jiaxin Qi, Baisheng Lai, Jianqiang Huang 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20046v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	10.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs的强化学习优化，提出HeRL框架利用失败轨迹作为事后经验指导探索，与’Large Language Models’高度相关（10分），属于’RLHF/RLAIF/DPO’范畴（10分），涉及’In-context Learning’（10分）和’Self-Correction/Self-Improvement’（10分）。其他关键词如MoE、SLMs、Scaling Laws、PEFT等未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文针对LLMs强化学习中探索效率低的问题，提出HeRL框架利用事后经验指导探索，实验证明能显著提升性能并支持测试时自我改进。

摘要翻译

基于规则奖励的强化学习（Reinforcement Learning, RL）近期在提升大语言模型（Large Language Models, LLMs）的通用推理能力方面取得了显著进展，但仍受限于当前策略分布内的低效探索。实际上，RL优化可被视为将策略导向能最大化奖励的理想分布，而有效的探索应将努力与期望目标对齐。基于这一洞见，我们提出HeRL——一种后见经验引导的强化学习框架，通过明确告知LLM奖励中指定的期望行为，来引导有效的探索。具体而言，HeRL将失败轨迹及其未满足的规则视为后见经验，作为上下文指导，促使策略探索当前分布之外符合期望的响应。此外，我们引入额外奖励，以激励模型在此类指导下生成更具改进潜力的响应。HeRL使得模型能够从期望的高质量样本中有效学习，无需从头开始反复试错，理论上能获得更准确的期望梯度估计。在多个基准测试上的广泛实验表明，HeRL相比基线方法实现了更优的性能提升，并能在测试阶段进一步受益于经验引导的自我改进。我们的代码公开于https://github.com/sikelifei/HeRL。

摘要 (Abstract)

Reinforcement Learning (RL) with rubric-based rewards has recently shown remarkable progress in enhancing general reasoning capabilities of Large Language Models (LLMs), yet still suffers from ineffective exploration confined to curent policy distribution. In fact, RL optimization can be viewed as steering the policy toward an ideal distribution that maximizes the rewards, while effective exploration should align efforts with desired target. Leveraging this insight, we propose HeRL, a Hindsight experience guided Reinforcement Learning framework to bootstrap effective exploration by explicitly telling LLMs the desired behaviors specified in rewards. Concretely, HeRL treats failed trajectories along with their unmet rubrics as hindsight experience, which serves as in-context guidance for the policy to explore desired responses beyond its current distribution. Additionally, we introduce a bonus reward to incentivize responses with greater potential for improvement under such guidance. HeRL facilitates effective learning from desired high quality samples without repeated trial-and-error from scratch, yielding a more accurate estimation of the expected gradient theoretically. Extensive experiments across various benchmarks demonstrate that HeRL achieves superior performance gains over baselines, and can further benefit from experience guided self-improvement at test time. Our code is available at https://github.com/sikelifei/HeRL.

关键词: Reinforcement Learning, Large Language Models, Exploration, Hindsight Experience, In-context Guidance, Self-improvement, Rubric-based Rewards

39. ❌ DIAL-KG: Schema-Free Incremental Knowledge Graph Construction via Dynamic Schema Induction and Evolution-Intent Assessment

作者: Weidong Bao, Yilin Wang, Ruyu Gao, Fangling Leng, Yubin Bao, Ge Yu 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20059v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文DIAL-KG专注于知识图谱构建方法，提出了一种基于元知识库的增量构建框架，涉及动态模式归纳、知识提取、治理裁决等具体技术。所有评分关键词均与大模型、深度学习技术原理或AI在科学领域的应用直接相关，而本文未提及任何大模型、深度学习技术或AI在生物信息学等科学领域的应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对传统静态知识图谱构建方法在动态数据场景下的局限性，提出了DIAL-KG框架，通过动态模式归纳和演化意图评估实现无模式的增量知识图谱构建，实验表明其在图谱质量和模式归纳方面达到最先进性能。

摘要翻译

知识图谱（Knowledge Graphs, KGs）是搜索、问答与推荐等应用的基础。传统的知识图谱构建方法主要为静态方法，依赖于基于固定语料库和预定义模式（schema）的单步构建。然而，此类方法在数据动态到达的现实场景中并非最优，因为纳入新信息需要进行完整且计算成本高昂的图谱重构。此外，预定义模式限制了知识图谱构建的灵活性。为应对这些局限，我们提出了DIAL KG，这是一个由元知识库（Meta-Knowledge Base, MKB）协调的增量式知识图谱构建闭环框架。该框架在一个三阶段循环中运行：（i）双轨提取，通过默认进行三元组生成，并在遇到复杂知识时切换至事件提取，以确保知识的完整性；（ii）治理裁决，确保所提取事实的保真度与时效性，以防止幻觉和知识陈旧；（iii）模式演化，从已验证的知识中归纳出新模式，以指导后续构建循环，并将本轮知识增量式地应用于现有知识图谱。大量实验表明，我们的框架在所构建图谱的质量和归纳出的模式质量上均达到了最先进的性能水平。

摘要 (Abstract)

Knowledge Graphs (KGs) are foundational to applications such as search, question answering, and recommendation. Conventional knowledge graph construction methods are predominantly static, rely ing on a single-step construction from a fixed corpus with a prede f ined schema. However, such methods are suboptimal for real-world sce narios where data arrives dynamically, as incorporating new informa tion requires complete and computationally expensive graph reconstruc tions. Furthermore, predefined schemas hinder the flexibility of knowl edge graph construction. To address these limitations, we introduce DIAL KG, a closed-loop framework for incremental KG construction orches trated by a Meta-Knowledge Base (MKB). The framework oper ates in a three-stage cycle: (i) Dual-Track Extraction, which ensures knowledge completeness by defaulting to triple generation and switching to event extraction for complex knowledge; (ii) Governance Adjudica tion, which ensures the fidelity and currency of extracted facts to prevent hallucinations and knowledge staleness; and (iii) Schema Evolution, in which new schemas are induced from validated knowledge to guide subsequent construction cycles, and knowledge from the current round is incrementally applied to the existing KG. Extensive experiments demon strate that our framework achieves state-of-the-art (SOTA) performance in the quality of both the constructed graph and the induced schemas.

关键词: Knowledge Graph Construction, Incremental Construction, Dynamic Schema Induction, Meta-Knowledge Base, Triple Generation, Event Extraction, Governance Adjudication, Schema Evolution

40. ❌ LoASR-Bench: Evaluating Large Speech Language Models on Low-Resource Automatic Speech Recognition Across Language Families

作者: Jianan Chen, Xiaoxue Gao, Tatsuya Kawahara, Nancy F. Chen 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20042v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文明确研究大型语言模型（LLMs）在语音语言模型（SpeechLMs）中的应用，并评估其在低资源自动语音识别（ASR）中的表现，因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文未涉及其他关键词的具体技术或应用，如MoE、SLMs、训练方法、推理技术、代理系统、压缩技术等，因此这些关键词得0分。论文虽涉及多语言评估，但未明确属于生物信息学或化学信息学等特定科学AI应用领域，因此’AI for Science OR Bioinformatics OR Cheminformatics’也得0分。

!!! tip deepseek-chat TL;DR

该论文提出了LoASR-Bench基准，用于评估大型语音语言模型在低资源语言自动语音识别中的表现，发现现有模型在处理真实世界低资源语言时存在局限性。

摘要翻译

大语言模型（LLMs）的进步显著推动了语音语言模型（SpeechLMs）的发展，使其在高资源条件下的自动语音识别（Automatic Speech Recognition, ASR）中表现出强劲性能。然而，现有基准测试主要集中于高资源语言，导致对SpeechLMs在低资源语言中的ASR行为理解不足。这一空白至关重要，因为实用的ASR系统必须可靠地支持低资源语言，并能在不同语系间泛化，它直接阻碍了基于SpeechLM的ASR在现实多语言场景中的部署。因此，评估SpeechLMs在低资源语言上的表现，以确保其跨不同语系的泛化能力，显得尤为必要。为解决这一问题，我们提出了LoASR-Bench，这是一个综合性基准测试，旨在评估最新SpeechLMs跨多样语系的低资源自动语音识别（ASR）性能。LoASR-Bench涵盖9个语系的25种语言，同时包含拉丁与非拉丁文字，从而能够对当前SpeechLMs的ASR表现进行跨语言和跨文字的评估。实验结果凸显了最新SpeechLMs在处理现实世界低资源语言方面的局限性。

摘要 (Abstract)

Large language models (LLMs) have driven substantial advances in speech language models (SpeechLMs), yielding strong performance in automatic speech recognition (ASR) under high-resource conditions. However, existing benchmarks predominantly focus on high-resource languages, leaving the ASR behavior of SpeechLMs in low-resource languages insufficiently understood. This gap is critical, as practical ASR systems must reliably support low-resource languages and generalize across diverse language families, and it directly hinders the deployment of SpeechLM-based ASR in real-world multilingual scenarios. As a result, it is essential to evaluate SpeechLMs on low-resource languages to ensure their generalizability across different language families. To address this problem, we propose \textbf{LoASR-Bench}, a comprehensive benchmark designed to evaluate \textbf{lo}w-resource \textbf{a}utomatic \textbf{s}peech \textbf{r}ecognition (\textbf{ASR}) of the latest SpeechLMs across diverse language families. LoASR-Bench comprises 25 languages from 9 language families, featuring both Latin and non-Latin scripts, enabling cross-linguistic and cross-script assessment of ASR performance of current SpeechLMs. Experimental results highlight the limitations of the latest SpeechLMs in handling real-world low-resource languages.

关键词: Large Language Models, Speech Language Models, Automatic Speech Recognition, Low-resource Languages, Benchmark Evaluation, Multilingual Scenarios, Language Families, Cross-linguistic Assessment

41. ❌ CoverageBench: Evaluating Information Coverage across Tasks and Domains

作者: Saron Samuel, Andrew Yates, Dawn Lawrie, Ian Soboroff, Trevor Adriaanse, Benjamin Van Durme, Eugene Yang 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20034v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	10.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于信息检索评估，特别是检索增强生成（RAG）系统中的信息覆盖度评估。仅与关键词’Retrieval-Augmented Generation OR RAG OR Retrieval-Generation’高度相关（10分），因为摘要明确提到RAG系统。其他关键词涉及大模型技术原理、训练方法、推理优化、应用领域等，论文未涉及这些方面，因此均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了CoverageBench，一个用于评估检索算法信息覆盖度的基准套件，特别关注检索增强生成（RAG）系统，并发布了涵盖多个领域和任务的数据集。

摘要翻译

我们希望衡量一种特定检索算法的信息覆盖度，即搜索结果覆盖了多少可用相关信息的范围。信息覆盖度是检索的核心方面，尤其在检索系统与生成模型结合于检索增强生成（RAG）系统中时。用于特定检索的经典指标——精确率和召回率——会随着检索到越来越多的相关文档而给予系统更高评价。然而，由于特定测试集中的相关性是针对单个文档定义的，并未考虑其他可能包含相同信息的文档，因此高召回率足以但并非确保覆盖度的必要条件。其他指标如排名偏置精确率（RBP）、归一化折损累计增益（nDCG）和平均精确率均值（MAP）同样如此。围绕网络搜索中多样性排序概念开发的测试集包含了支持网络领域覆盖度概念的多个方面。在本研究中，我们基于现有测试集构建了一套用于评估信息覆盖度的测试集。这套测试集为研究人员提供了一个跨越多种类型和任务的统一测试平台。所有主题、信息单元、相关性标签和基线排名均已发布于Hugging Face Datasets平台，同时附有访问公开文档集的说明。

摘要 (Abstract)

We wish to measure the information coverage of an ad hoc retrieval algorithm, that is, how much of the range of available relevant information is covered by the search results. Information coverage is a central aspect for retrieval, especially when the retrieval system is integrated with generative models in a retrieval-augmented generation (RAG) system. The classic metrics for ad hoc retrieval, precision and recall, reward a system as more and more relevant documents are retrieved. However, since relevance in ad hoc test collections is defined for a document without any relation to other documents that might contain the same information, high recall is sufficient but not necessary to ensure coverage. The same is true for other metrics such as rank-biased precision (RBP), normalized discounted cumulative gain (nDCG), and mean average precision (MAP). Test collections developed around the notion of diversity ranking in web search incorporate multiple aspects that support a concept of coverage in the web domain. In this work, we construct a suite of collections for evaluating information coverage from existing collections. This suite offers researchers a unified testbed spanning multiple genres and tasks. All topics, nuggets, relevance labels, and baseline rankings are released on Hugging Face Datasets, along with instructions for accessing the publicly available document collections.

关键词: information coverage, retrieval-augmented generation, RAG, ad hoc retrieval, evaluation benchmark, test collections, diversity ranking, Hugging Face Datasets

42. ❌ Orchestrating Human-AI Software Delivery: A Retrospective Longitudinal Field Study of Three Software Modernization Programs

作者: Maximiliano Armesto, Christophe Kolb 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20028v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究AI在软件工程团队交付中的应用，特别是通过Chiron平台协调人类和AI代理完成软件现代化项目。该研究主要涉及AI代理的工作流程协调（与’LLM Agents’和’Multi-agent Systems’有一定关联），但未深入探讨大模型技术原理、训练方法、推理优化、对齐技术、模型压缩等具体技术细节。论文未提及LLM、MoE、Scaling Laws、Pre-training、RLHF、RAG、Attention优化、推理方法、模型解释性等具体技术关键词。

!!! tip deepseek-chat TL;DR

该研究通过三个真实软件现代化项目的纵向实地研究，证明了将AI嵌入协调的工作流程（而非作为孤立的编码助手）能显著提升交付速度、覆盖率和减少问题数量。

摘要翻译

当前关于人工智能在软件工程中应用的证据仍严重偏向于个体任务完成层面，而团队级交付的证据依然匮乏。本文报告了一项针对Chiron平台的回顾性纵向实地研究，该工业平台在分析、规划、实施和验证四个交付阶段协调人类与AI智能体。研究涵盖三个真实的软件现代化项目——一次COBOL银行系统迁移（约3万行代码）、一次大型会计系统现代化（约40万行代码）以及一次.NET/Angular抵押贷款系统现代化（约3万行代码）——并观测了五种交付配置：传统基准配置与四个连续平台版本（V1至V4）。基准分析将观测结果（阶段时长、任务量、验证阶段问题数、首次发布覆盖率）与建模结果（在明确人员配置场景下的人日数与资深等效工作量）进行分离。在基准人员配置假设下，项目组合总时长从36.0项目周降至9.3项目周；建模原始工作量从1080.0人日降至232.5人日；建模资深等效工作量从1080.0 SEE日降至139.5 SEE日；验证阶段问题负载从每百个任务8.03个问题降至2.09个问题；首次发布覆盖率从77.0%提升至90.5%。V3和V4版本新增了验收标准验证、仓库原生评审及人机混合执行功能，同步提升了交付速度、覆盖率和问题负载控制。本研究证据支持一个核心论点：当人工智能被嵌入协调有序的工作流而非作为孤立的编码助手部署时，方能实现最大效益。

摘要 (Abstract)

Evidence on AI in software engineering still leans heavily toward individual task completion, while evidence on team-level delivery remains scarce. We report a retrospective longitudinal field study of Chiron, an industrial platform that coordinates humans and AI agents across four delivery stages: analysis, planning, implementation, and validation. The study covers three real software modernization programs – a COBOL banking migration (~30k LOC), a large accounting modernization (~400k LOC), and a .NET/Angular mortgage modernization (~30k LOC) – observed across five delivery configurations: a traditional baseline and four successive platform versions (V1–V4). The benchmark separates observed outcomes (stage durations, task volumes, validation-stage issues, first-release coverage) from modeled outcomes (person-days and senior-equivalent effort under explicit staffing scenarios). Under baseline staffing assumptions, portfolio totals move from 36.0 to 9.3 summed project-weeks; modeled raw effort falls from 1080.0 to 232.5 person-days; modeled senior-equivalent effort falls from 1080.0 to 139.5 SEE-days; validation-stage issue load falls from 8.03 to 2.09 issues per 100 tasks; and first-release coverage rises from 77.0% to 90.5%. V3 and V4 add acceptance-criteria validation, repository-native review, and hybrid human-agent execution, simultaneously improving speed, coverage, and issue load. The evidence supports a central thesis: the largest gains appear when AI is embedded in an orchestrated workflow rather than deployed as an isolated coding assistant.

关键词: AI in software engineering, human-AI coordination, software modernization, delivery workflow, AI agents, field study, orchestrated workflow, team-level delivery

43. ❌ Detached Skip-Links and $R$-Probe: Decoupling Feature Aggregation from Gradient Propagation for MLLM OCR

作者: Ziye Yuan, Ruchang Yao, Chengxin Zheng, Yusheng Zhao, Daxiang Dong, Ming Zhang 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20020v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于多模态大语言模型（MLLMs）在OCR任务中的优化问题，提出Detached Skip-Links和R-Probe方法来解决特征融合中的梯度干扰问题。论文核心与大语言模型（LLMs）高度相关，因为MLLMs是LLMs的扩展，且论文明确提到LLM层和训练。其他关键词如MoE、SFT、RAG、量化等均未涉及，论文未讨论科学领域的AI应用，因此AI for Science也得0分。

!!! tip deepseek-chat TL;DR

论文解决了多模态大语言模型在OCR任务中因skip-link梯度干扰导致的训练不稳定问题，提出了Detached Skip-Links和R-Probe方法，在多个基准测试中提升了性能。

摘要翻译

多模态大语言模型（MLLMs）擅长高级推理，但在需要细粒度视觉细节的OCR任务中却表现不佳，这些细节往往被破坏或错位。我们发现多层特征融合中存在一个被忽视的优化问题。跳跃连接引入了从高层语义目标到早期视觉层的直接反向传播路径。这种机制会覆盖低层信号并破坏训练稳定性。为减轻这种梯度干扰，我们提出了解耦跳跃连接，这是一种极简的修改方案：在前向传播中复用浅层特征，同时在联合训练期间阻止梯度通过跳跃分支回传。这种非对称设计减少了梯度干扰，提高了稳定性和收敛性，且无需增加可学习参数。为了诊断细粒度信息是否被保留并能被大语言模型使用，我们引入了 $R$-Probe，它利用从大语言模型前四分之一层初始化的浅层解码器，测量投影后视觉令牌的像素级可重建性。在多种ViT骨干网络、多模态基准测试以及高达700万训练样本的规模下，我们的方法在OCR相关基准上持续提升，并在通用多模态任务上取得了显著增益。

摘要 (Abstract)

Multimodal large language models (MLLMs) excel at high-level reasoning yet fail on OCR tasks where fine-grained visual details are compromised or misaligned. We identify an overlooked optimization issue in multi-layer feature fusion. Skip pathways introduce direct back-propagation paths from high-level semantic objectives to early visual layers. This mechanism overwrites low-level signals and destabilizes training. To mitigate this gradient interference, we propose Detached Skip-Links, a minimal modification that reuses shallow features in the forward pass while stopping gradients through the skip branch during joint training. This asymmetric design reduces gradient interference, improving stability and convergence without adding learnable parameters. To diagnose whether fine-grained information is preserved and usable by an LLM, we introduce $R$-Probe, which measures pixel-level reconstructability of projected visual tokens using a shallow decoder initialized from the first quarter of the LLM layers. Across multiple ViT backbones and multimodal benchmarks, and at scales up to 7M training samples, our approach consistently improves OCR-centric benchmarks and delivers clear gains on general multimodal tasks.

关键词: Multimodal Large Language Models, OCR, Skip-links, Gradient interference, Feature aggregation, Training stability, Visual tokens, Fine-grained details

44. ❌ Physics-Informed Long-Range Coulomb Correction for Machine-learning Hamiltonians

作者: Yang Zhong, Xiwen Li, Xingao Gong, Hongjun Xiang 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20007v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文专注于机器学习的电子哈密顿量在计算材料科学中的应用，特别是通过物理信息的长程库仑修正来改进模型。论文的核心是机器学习在科学计算（具体是材料物理）中的应用，因此仅与关键词’AI for Science OR Bioinformatics OR Cheminformatics’相关，评分为8分（有一定关联，属于AI for Science范畴，但非核心的生物信息学或化学信息学）。其他所有关键词均涉及大语言模型（LLM）及其相关技术（如训练、对齐、推理、代理等），而本文研究的是基于图神经网络（GNN）的特定科学计算模型，未涉及任何语言模型或通用AI技术，因此相关度为0分。

!!! tip deepseek-chat TL;DR

该论文解决了机器学习电子哈密顿量中忽略长程库仑相互作用的问题，通过提出一种物理信息的长程修正框架（HamGNN-LR），在极性晶体和异质结构中实现了误差显著降低和模型可转移性的提升。

摘要翻译

机器学习电子哈密顿量相较于密度泛函理论实现了数量级的速度提升，但现有模型忽略了支配极性晶体与异质结构物理特性的长程库仑相互作用。我们通过静电能的变分分解，在非正交原子轨道基组中推导出闭式表达的长程哈密顿矩阵元，从而建立了从电子密度矩阵到有效原子电荷的变分一致性映射。我们将此框架实现于HamGNN-LR——一种结合E(3)等变消息传递与倒空间Ewald求和的双通道架构。基准测试表明，基于物理的长程修正是不可或缺的：纯数据驱动的注意力机制无法捕捉宏观静电势。对极性氧化锌(ZnO)薄片、硒化镉/硫化锌(CdSe/ZnS)异质结构以及氮化镓/氮化铝(GaN/AlN)超晶格的测试显示，其误差降低了二至三倍，且能稳健地迁移到远超训练规模的体系，彻底消除了短程模型在存在内建电场时特有的阶梯状赝象。

摘要 (Abstract)

Machine-learning electronic Hamiltonians achieve orders-of-magnitude speedups over density-functional theory, yet current models omit long-range Coulomb interactions that govern physics in polar crystals and heterostructures. We derive closed-form long-range Hamiltonian matrix elements in a nonorthogonal atomic-orbital basis through variational decomposition of the electrostatic energy, deriving a variationally consistent mapping from the electron density matrix to effective atomic charges. We implement this framework in HamGNN-LR, a dual-channel architecture combining E(3)-equivariant message passing with reciprocal-space Ewald summation. Benchmarks demonstrate that physics-based long-range corrections are essential: purely data-driven attention mechanisms fail to capture macroscopic electrostatic potentials. Benchmarks on polar ZnO slabs, CdSe/ZnS heterostructures, and GaN/AlN superlattices show two- to threefold error reductions and robust transferability to systems far beyond training sizes, eliminating the characteristic staircase artifacts that plague short-range models in the presence of built-in electric fields.

关键词: Machine-learning Hamiltonians, Long-range Coulomb interactions, Physics-informed correction, E(3)-equivariant message passing, Electrostatic energy, Polar crystals, Heterostructures, Transferability

45. ❌ Breaking the Capability Ceiling of LLM Post-Training by Reintroducing Markov States

作者: Yurun Yuan, Tengyang Xie 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19987v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM后训练中的强化学习（RL）方法，直接涉及"Large Language Models”、“Post-training"和"RLHF"等关键词，这些是论文的核心内容（10分）。其他关键词如MoE、SLMs、Scaling Laws、RAG、Quantization等与论文研究的RL后训练瓶颈和Markov状态方法无直接关联（0分）。

!!! tip deepseek-chat TL;DR

该论文研究发现当前LLM后训练中的强化学习方法存在"能力天花板"问题，并提出通过重新引入Markov状态来突破这一限制，在复杂逻辑谜题上显著提升了性能。

摘要翻译

强化学习（RL）已成为大型语言模型（LLM）后训练与对齐的标准范式，但近期证据表明其面临一个持续的“能力上限”：与经典强化学习系统能够发现新策略不同，用于LLM的强化学习往往仅充当预训练权重中已有潜在模式的微调器。本文中，我们识别出一个根本性的结构瓶颈：经典强化学习依赖于紧凑且信息丰富的马尔可夫状态，而当前LLM后训练的公式却受限于不断扩展的动作历史。
我们重新审视了长期在强化学习中处于核心地位、却在LLM后训练中缺失的经典原则：显式马尔可夫状态。理论上，我们提供了严格的理论保证，证明利用估计的马尔可夫状态能显著降低样本复杂度。实证上，我们通过一系列复杂逻辑谜题实验表明，引入马尔可夫状态能够持续突破标准强化学习后训练的性能边界。我们的研究结果表明，超越“以历史为状态”的建模方式，转向结构化的马尔可夫表征，对于在生成式人工智能中实现开放式探索和真正新颖的推理能力至关重要。

摘要 (Abstract)

Reinforcement learning (RL) has become a standard paradigm for post-training and aligning Large Language Models (LLMs), yet recent evidence suggests it faces a persistent “capability ceiling”: unlike classical RL systems that discover novel strategies, RL for LLMs often acts as a mere refiner of patterns already latent in pre-trained weights. In this work, we identify a fundamental structural bottleneck: while classical RL relies on compact, informative Markov states, current LLM post-training formulations are tethered to an ever-expanding history of actions. We revisit a classical principle long central to RL yet absent from LLM post-training: explicit Markov states. Theoretically, we provide rigorous guarantees demonstrating that leveraging estimated Markov states can significantly reduce sample complexity. Empirically, we show that introducing Markov states consistently breaks the performance boundaries of standard RL post-training across a suite of complex logic puzzles. Our findings suggest that moving beyond “history-as-state” modeling in favor of structured Markovian representations is essential for unlocking open-ended discovery and genuinely new reasoning capabilities in Generative AI.

关键词: Large Language Models, Post-training, Reinforcement Learning, Markov States, Capability Ceiling, RLHF, Sample Complexity, Reasoning Capabilities

46. ❌ Promoting Critical Thinking With Domain-Specific Generative AI Provocations

作者: Thomas Şerban von Davier, Hao-Ping Lee, Jodi Forlizzi, Sauvik Das 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19975v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究生成式AI（GenAI）如何通过领域特定的挑衅（如要求澄清和论证的问题）促进批判性思维，涉及两个具体领域（美术解释和AI隐私）的工具设计。与关键词的相关性分析：1）“Large Language Models"等：论文讨论GenAI，可能涉及LLMs，但未明确提及具体模型技术，给5分。2）“Chain of Thought"等：论文关注批判性思维，与多步推理相关，但未直接使用CoT术语，给5分。3）“System 2 Thinking"等：论文核心是促进深度、慢速的批判性思维，与System 2 Thinking高度相关，给8分。4）“Self-Correction"等：论文提到AI驱动的挑衅可支持自我反思和改进，给5分。其他关键词（如MoE、SFT、RAG等）涉及具体技术细节或领域（如生物信息学），论文未涉及，给0分。

!!! tip deepseek-chat TL;DR

该论文研究了如何通过设计领域特定的生成式AI挑衅（如要求用户澄清和论证的问题）来促进批判性思维，并基于两个原型工具（ArtBot和Privy）的观察发现，这种基于生产性摩擦和用户贡献的交互能有效支持批判性思维，且需要适应不同用户偏好和专业知识水平。

摘要翻译

关于生成式人工智能（GenAI）对批判性思维影响的证据不一，研究表明其既可能产生危害也可能带来益处，具体取决于其实施方式。有观点认为，由人工智能驱动的激发手段——例如要求人类进行澄清和论证的提问——有利于引发批判性思维。基于我们在设计和评估两款用于知识工作的GenAI工具（艺术评论领域的ArtBot和人工智能隐私领域的Privy）过程中积累的经验，我们反思了设计决策如何塑造此类激发手段的形式与有效性。我们的观察和用户反馈表明，通过生产性摩擦以及依赖于用户贡献的交互所实现的、针对特定领域的激发手段，能够切实有效地支持批判性思维。我们展示了参与者对两款原型的体验，并讨论了支持批判性思维可能需要超越静态的激发手段，转向能够适应用户偏好与专业水平的方法。

摘要 (Abstract)

The evidence on the effects of generative AI (GenAI) on critical thinking is mixed, with studies suggesting both potential harms and benefits depending on its implementation. Some argue that AI-driven provocations, such as questions asking for human clarification and justification, are beneficial for eliciting critical thinking. Drawing on our experience designing and evaluating two GenAI-powered tools for knowledge work, ArtBot in the domain of fine art interpretation and Privy in the domain of AI privacy, we reflect on how design decisions shape the form and effectiveness of such provocations. Our observations and user feedback suggest that domain-specific provocations, implemented through productive friction and interactions that depend on user contribution, can meaningfully support critical thinking. We present participant experiences with both prototypes and discuss how supporting critical thinking may require moving beyond static provocations toward approaches that adapt to user preferences and levels of expertise.

关键词: generative AI, critical thinking, domain-specific provocations, productive friction, user contribution, AI design, knowledge work, human-AI interaction

47. ❌ X-World: Controllable Ego-Centric Multi-Camera World Models for Scalable End-to-End Driving

作者: Chaoda Zheng, Sean Li, Jinhao Deng, Zhennan Wang, Shijia Chen, Liqiang Xiao, Ziheng Chi, Hongbin Lin, Kangjie Chen, Boyang Wang, Yu Zhang, Xianming Liu 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19979v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文X-World专注于自动驾驶领域的世界模型构建，核心是开发一个可控的多摄像头视频生成世界模型，用于模拟未来驾驶场景。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、代理等）完全无关，仅与’World Models AND General World Models’高度相关，因为论文明确构建并研究了一个用于自动驾驶的生成式世界模型。其他关键词均未在论文标题或摘要中提及或暗示。

!!! tip deepseek-chat TL;DR

该论文针对自动驾驶评估中真实路测成本高、场景覆盖有限的问题，提出了X-World——一个可控的、以自我为中心的多摄像头生成世界模型，能够根据未来动作序列生成高质量、一致的多视角未来视频流，为可扩展和可复现的评估提供了实用基础。

摘要翻译

在自动驾驶端到端时代，视觉-语言-动作策略直接将原始传感器数据流映射为驾驶动作，可扩展且可靠的评估变得日益关键。然而，当前的评估流程仍严重依赖真实道路测试，这种方法成本高昂、偏向于有限场景覆盖且难以复现。这些挑战催生了对真实世界模拟器的需求，该模拟器应能在给定动作下生成逼真的未来观测，同时在长时程中保持可控性与稳定性。我们提出X-World，一种动作条件化的多相机生成世界模型，可直接在视频空间中模拟未来观测。给定同步的多视角相机历史数据与未来动作序列，X-World能生成遵循指令动作的未来多相机视频流。为确保可复现且可编辑的场景推演，X-World进一步支持对动态交通参与者与静态道路元素的可选控制，并保留文本提示接口以实现外观层面的控制（如天气与昼夜）。除世界模拟外，X-World还能通过外观提示条件化实现视频风格迁移，同时保持底层动作与场景动态。X-World的核心是一个多视角潜在视频生成器，其设计旨在显式促进不同控制信号下的跨视角几何一致性与时序连贯性。实验表明，X-World实现了高质量的多视角视频生成，具备以下特性：（i）跨相机视角的强一致性，（ii）长时程推演中稳定的时序动态，（iii）严格遵循动作指令并忠实适配可选场景控制的高可控性。这些特性使X-World成为可扩展、可复现评估的实用基础。

摘要 (Abstract)

Scalable and reliable evaluation is increasingly critical in the end-to-end era of autonomous driving, where vision–language–action (VLA) policies directly map raw sensor streams to driving actions. Yet, current evaluation pipelines still rely heavily on real-world road testing, which is costly, biased toward limited scenario coverage, and difficult to reproduce. These challenges motivate a real-world simulator that can generate realistic future observations under proposed actions, while remaining controllable and stable over long horizons. We present X-World, an action-conditioned multi-camera generative world model that simulates future observations directly in video space. Given synchronized multi-view camera history and a future action sequence, X-World generates future multi-camera video streams that follow the commanded actions. To ensure reproducible and editable scene rollouts, X-World further supports optional controls over dynamic traffic agents and static road elements, and retains a text-prompt interface for appearance-level control (e.g., weather and time of day). Beyond world simulation, X-World also enables video style transfer by conditioning on appearance prompts while preserving the underlying action and scene dynamics. At the core of X-World is a multi-view latent video generator designed to explicitly encourage cross-view geometric consistency and temporal coherence under diverse control signals. Experiments show that X-World achieves high-quality multi-view video generation with (i) strong view consistency across cameras, (ii) stable temporal dynamics over long rollouts, and (iii) high controllability with strict action following and faithful adherence to optional scene controls. These properties make X-World a practical foundation for scalable and reproducible evaluation.

关键词: world models, autonomous driving, multi-camera video generation, action-conditioned simulation, controllable generation, scalable evaluation, end-to-end driving, vision-language-action policies

48. ❌ Trojan’s Whisper: Stealthy Manipulation of OpenClaw through Injected Bootstrapped Guidance

作者: Fazhong Liu, Zhuoyan Chen, Tu Lan, Haozhen Tan, Zhenyu Xu, Xiang Li, Guoxing Chen, Yan Meng, Haojin Zhu 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19974v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究自主编码代理（OpenClaw平台）的安全漏洞，核心涉及LLM驱动的自主代理（LLM Agents/Autonomous Agents）及其工具使用能力（Tool Use/Function Calling），因此这三个关键词高度相关（10分）。论文未涉及其他技术原理创新（如MoE、量化、推理加速等）或科学领域应用，其余关键词评分为0分。

!!! tip deepseek-chat TL;DR

论文揭示了自主编码代理平台OpenClaw中通过引导注入（guidance injection）的隐蔽攻击向量，成功构建了26个恶意技能并在现实基准测试中达到16.0%-64.2%的攻击成功率，同时94%的恶意技能能逃逸现有检测。

摘要翻译

自主编码代理正日益融入软件开发工作流，其能力已从代码建议扩展到主动系统交互与环境管理。OpenClaw作为这一新兴范式的代表性平台，引入了可扩展的技能生态系统，允许第三方开发者在代理初始化期间通过生命周期钩子注入行为引导。尽管这种设计增强了自动化与定制化能力，却也开辟了一个新颖且未被探索的攻击面。本文识别并系统化描述了引导注入攻击——一种将对抗性操作叙述嵌入引导文件的隐蔽攻击向量。与传统依赖显式恶意指令的提示注入不同，引导注入通过将有害操作包装为常规最佳实践，从而操纵代理的推理上下文。这些叙述会被自动整合至代理的解释框架中，并在未来任务执行时施加影响而不引发怀疑。我们构建了涵盖13类攻击（包括凭证窃取、工作区破坏、权限提升和持久后门安装）的26项恶意技能，并使用我们开发的实际开发者工作区基准测试平台ORE-Bench进行评估。在52个自然用户提示和六种最先进的大语言模型后端上，攻击成功率介于16.0%至64.2%之间，且大多数恶意操作无需用户确认即可自主执行。此外，94%的恶意技能能够规避现有静态及基于大语言模型的检测器扫描。我们的研究揭示了自主代理生态系统设计中的根本性矛盾，并强调了基于能力隔离、运行时策略执行和透明引导溯源的防御机制的迫切需求。

摘要 (Abstract)

Autonomous coding agents are increasingly integrated into software development workflows, offering capabilities that extend beyond code suggestion to active system interaction and environment management. OpenClaw, a representative platform in this emerging paradigm, introduces an extensible skill ecosystem that allows third-party developers to inject behavioral guidance through lifecycle hooks during agent initialization. While this design enhances automation and customization, it also opens a novel and unexplored attack surface. In this paper, we identify and systematically characterize guidance injection, a stealthy attack vector that embeds adversarial operational narratives into bootstrap guidance files. Unlike traditional prompt injection, which relies on explicit malicious instructions, guidance injection manipulates the agent’s reasoning context by framing harmful actions as routine best practices. These narratives are automatically incorporated into the agent’s interpretive framework and influence future task execution without raising suspicion.We construct 26 malicious skills spanning 13 attack categories including credential exfiltration, workspace destruction, privilege escalation, and persistent backdoor installation. We evaluate them using ORE-Bench, a realistic developer workspace benchmark we developed. Across 52 natural user prompts and six state-of-the-art LLM backends, our attacks achieve success rates from 16.0% to 64.2%, with the majority of malicious actions executed autonomously without user confirmation. Furthermore, 94% of our malicious skills evade detection by existing static and LLM-based scanners. Our findings reveal fundamental tensions in the design of autonomous agent ecosystems and underscore the urgent need for defenses based on capability isolation, runtime policy enforcement, and transparent guidance provenance.

关键词: autonomous coding agents, guidance injection, OpenClaw, LLM backends, attack surface, malicious skills, stealthy attack, agent security

49. ❌ On the Ability of Transformers to Verify Plans

作者: Yash Sarrof, Yupei Du, Katharina Stein, Alexander Koller, Sylvie Thiébaux, Michael Hahn 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19954v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究Transformer在AI规划任务中的验证能力，属于基础模型架构的理论分析，但未涉及大模型技术原理创新或具体应用。所有关键词均聚焦于大模型技术细节（如训练方法、优化技术、应用框架等）或特定科学领域应用，而本文仅涉及Transformer架构的理论分析，与这些具体技术关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文研究了Transformer模型验证规划方案正确性的能力，通过理论分析和实验验证，识别了Transformer能够学习验证长规划方案的一类经典规划领域及其结构特性。

摘要翻译

Transformer在AI规划任务中表现出不稳定的成功，且对其泛化能力何时应被预期的理论理解仍较为有限。我们通过分析仅解码器模型验证给定规划是否正确解决特定规划实例的能力，为解决这一空白迈出了重要步伐。为分析测试时对象数量（即有效输入字母表）增长的一般情境，我们引入了C*-RASP——这是C-RASP的扩展，旨在为Transformer在序列长度与词汇量同步增长的情况下建立长度泛化保证。我们的研究结果识别出一大类经典规划领域，其中Transformer可被证明能够学习验证长规划，并揭示了显著影响长度可泛化解可学习性的结构特性。实证实验佐证了我们的理论。

摘要 (Abstract)

Transformers have shown inconsistent success in AI planning tasks, and theoretical understanding of when generalization should be expected has been limited. We take important steps towards addressing this gap by analyzing the ability of decoder-only models to verify whether a given plan correctly solves a given planning instance. To analyse the general setting where the number of objects – and thus the effective input alphabet – grows at test time, we introduce C*-RASP, an extension of C-RASP designed to establish length generalization guarantees for transformers under the simultaneous growth in sequence length and vocabulary size. Our results identify a large class of classical planning domains for which transformers can provably learn to verify long plans, and structural properties that significantly affects the learnability of length generalizable solutions. Empirical experiments corroborate our theory.

关键词: Transformers, AI planning, plan verification, length generalization, C*-RASP, decoder-only models, theoretical analysis, generalization guarantees

50. ❌ Graph2TS: Structure-Controlled Time Series Generation via Quantile-Graph VAEs

作者: Shaoshuai Du, Joze M. Rozanec, Andy Pimentel, Ana-Lucia Varbanescu 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19970v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文《Graph2TS: Structure-Controlled Time Series Generation via Quantile-Graph VAEs》专注于时间序列生成，提出了一种基于分位数图的条件变分自编码器方法，用于从结构图生成时间序列。论文的核心是时间序列建模、生成模型（VAE）和图表示，与绝大多数关键词（涉及大模型技术、训练方法、推理优化、对齐、代理等）完全无关。唯一略有相关的是“AI for Science OR Bioinformatics OR Cheminformatics”，因为论文在ECG、EEG等生物医学信号上进行了实验，属于AI在科学/生物信息学领域的应用，但并非核心焦点，因此给予5分（有一定关联）。其他关键词均未涉及，评分为0。

!!! tip deepseek-chat TL;DR

该论文解决了在生成高度波动、周期性弱的时间序列时，如何平衡全局时间结构保持与局部随机变化建模的难题，提出了一种基于分位数图条件变分自编码器（Graph2TS）的方法，实现了从结构图到时间序列的跨模态生成，并在多个数据集上验证了其在分布保真度、时间对齐和代表性方面的优越性。

摘要翻译

尽管近期生成模型能够生成具有相近边缘分布的时间序列，但其往往面临保持全局时间结构与建模随机局部变异之间的根本性张力，对于周期性弱或不规则的高度波动信号尤为如此。在此类场景中直接进行分布匹配可能放大噪声或抑制有意义的时间模式。本研究提出一种时间序列生成的结构-残差视角，将时序数据视为结构骨架与随机残差动态的组合，从而推动将全局组织与样本级变异性分离。基于这一洞见，我们采用基于分位数的转移图来表征时间序列结构，该表示法能紧凑地捕捉全局分布依赖与时间依赖关系。在此基础上，我们提出Graph2TS——一种基于分位数图的条件变分自编码器，可实现从结构图到时序数据的跨模态生成。通过以结构而非标签或元数据作为生成条件，该模型在保持全局时间组织的同时实现了可控的随机变异。在太阳黑子、电力负荷、心电图及脑电图等多类数据集上的实验表明，相较于基于扩散模型与生成对抗网络的基线方法，本模型在分布保真度、时间对齐度和表征代表性方面均有提升，凸显出结构控制与跨模态生成作为时间序列建模方向的潜力。

摘要 (Abstract)

Although recent generative models can produce time series with close marginal distributions, they often face a fundamental tension between preserving global temporal structure and modeling stochastic local variations, particularly for highly volatile signals with weak or irregular periodicity. Direct distribution matching in such settings can amplify noise or suppress meaningful temporal patterns. In this work, we propose a structure-residual perspective on time-series generation, viewing temporal data as the combination of a structural backbone and stochastic residual dynamics, thereby motivating the separation of global organization from sample-level variability. Based on this insight, we represent time-series structure using a quantile-based transition graph that compactly captures global distributional and temporal dependencies. Building on this representation, we propose Graph2TS, a quantile-graph conditioned variational autoencoder that performs cross-modal generation from structural graphs to time series. By conditioning generation on structure rather than labels or metadata, the model preserves global temporal organization while enabling controlled stochastic variation. Experiments on diverse datasets, including sunspot, electricity load, ECG, and EEG signals, demonstrate improved distributional fidelity, temporal alignment, and representativeness compared to diffusion- and GAN-based baselines, highlighting structure-controlled and cross-modal generation as a promising direction for time-series modeling.

关键词: time series generation, structure-controlled generation, quantile-graph, variational autoencoder, cross-modal generation, temporal structure, stochastic variation, Graph2TS

51. ❌ HiPath: Hierarchical Vision-Language Alignment for Structured Pathology Report Prediction

作者: Ruicheng Yuan, Zhenxuan Zhang, Anbang Wang, Liwei Hu, Xiangqian Hua, Yaya Peng, Jiawei Luo, Guang Yang 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19957v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文HiPath专注于病理学领域的视觉-语言模型（VLM）应用，核心是结构化病理报告预测，属于AI在生物医学（病理学）领域的应用。所有关键词中，只有’AI for Science OR Bioinformatics OR Cheminformatics’直接相关，因为病理学是生物信息学/生物医学AI的子领域。其他关键词主要涉及大语言模型（LLM）的技术原理、训练方法、推理优化、代理系统等，而本文使用的是视觉-语言模型（基于UNI2和Qwen3），未涉及LLM-specific技术如MoE、Scaling Laws、RLHF、PEFT、RAG、CoT、Agents等，也未讨论模型压缩、推理加速、可解释性等通用深度学习主题。因此，仅最后一个关键词得10分（高度相关），其余均为0分（完全无关）。

!!! tip deepseek-chat TL;DR

本文提出了HiPath，一个轻量级视觉-语言模型框架，用于从病理图像生成结构化诊断报告，在真实世界数据上实现了高准确性和安全性，并展示了良好的跨医院泛化能力。

摘要翻译

病理报告是结构化、多粒度的文档，其编码了针对一个或多个解剖部位的诊断结论、组织学分级及辅助检测结果；然而现有的病理视觉语言模型（VLMs）将此类输出简化为扁平标签或自由文本。本文提出HiPath，一个基于冻结UNI2与Qwen3主干构建的轻量级VLM框架，该框架将结构化报告预测作为其核心训练目标。三个总计1500万参数的可训练模块分别处理问题的互补层面：用于多图像视觉编码的层级化切片聚合器（Hierarchical Patch Aggregator, HiPA）、通过最优传输实现跨模态对齐的层级化对比学习（Hierarchical Contrastive Learning, HiCL），以及用于结构化诊断生成的基于槽位的掩码诊断预测（Slot-based Masked Diagnosis Prediction, Slot-MDP）。在来自三家医院的74.9万例真实世界中文病理病例数据上训练后，HiPath在严格准确率上达到68.9%，临床可接受准确率达74.7%，安全率为97.3%，在相同冻结主干下优于所有基线模型。跨医院评估验证了其泛化能力，严格准确率仅下降3.4个百分点，同时保持97.1%的安全率。

摘要 (Abstract)

Pathology reports are structured, multi-granular documents encoding diagnostic conclusions, histological grades, and ancillary test results across one or more anatomical sites; yet existing pathology vision-language models (VLMs) reduce this output to a flat label or free-form text. We present HiPath, a lightweight VLM framework built on frozen UNI2 and Qwen3 backbones that treats structured report prediction as its primary training objective. Three trainable modules totalling 15M parameters address complementary aspects of the problem: a Hierarchical Patch Aggregator (HiPA) for multi-image visual encoding, Hierarchical Contrastive Learning (HiCL) for cross-modal alignment via optimal transport, and Slot-based Masked Diagnosis Prediction (Slot-MDP) for structured diagnosis generation. Trained on 749K real-world Chinese pathology cases from three hospitals, HiPath achieves 68.9% strict and 74.7% clinically acceptable accuracy with a 97.3% safety rate, outperforming all baselines under the same frozen backbone. Cross-hospital evaluation confirms generalisation with only a 3.4pp drop in strict accuracy while maintaining 97.1% safety.

关键词: pathology report prediction, vision-language model, hierarchical alignment, structured diagnosis generation, multi-granular documents, clinical accuracy, cross-hospital evaluation, frozen backbone

52. ❌ RAM: Recover Any 3D Human Motion in-the-Wild

作者: Sen Jia, Ning Zhu, Jinqin Zhong, Jiale Zhou, Huaping Zhang, Jenq-Neng Hwang, Lei Li 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19929v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《RAM: Recover Any 3D Human Motion in-the-Wild》专注于计算机视觉领域，特别是3D人体运动捕捉技术，涉及运动跟踪、姿态估计和重建。虽然论文使用了深度学习技术（如Temporal HMR模块），但其核心内容与提供的关键词（主要围绕大语言模型、训练方法、推理技术、对齐、压缩、代理等）完全无关。论文未提及任何语言模型、训练范式、推理加速、模型压缩或AI for Science的具体应用（如生物信息学）。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为RAM的通用化无标记3D人体运动捕捉方法，通过运动感知语义跟踪器、记忆增强时序HMR模块和轻量级预测器，在野外多人场景中显著提升了跟踪稳定性和3D精度。

摘要翻译

RAM采用了一种结合自适应卡尔曼滤波的运动感知语义跟踪器，以在严重遮挡和动态交互下实现鲁棒的身份关联。记忆增强的时序人体网格恢复模块通过注入时空先验信息，进一步提升了人体运动重建能力，实现了一致且平滑的运动估计。此外，轻量级的预测器模块能够预测未来姿态以保持重建的连续性，而门控融合器则自适应地融合重建特征与预测特征，确保了结果的连贯性与鲁棒性。在PoseTrack和3DPW等真实场景多人基准测试上的实验表明，RAM在零样本跟踪稳定性和三维精度方面均显著超越了以往的先进方法，为无标记野外三维人体运动捕捉提供了一个可泛化的范式。

摘要 (Abstract)

RAM incorporates a motion-aware semantic tracker with adaptive Kalman filtering to achieve robust identity association under severe occlusions and dynamic interactions. A memory-augmented Temporal HMR module further enhances human motion reconstruction by injecting spatio-temporal priors for consistent and smooth motion estimation. Moreover, a lightweight Predictor module forecasts future poses to maintain reconstruction continuity, while a gated combiner adaptively fuses reconstructed and predicted features to ensure coherence and robustness. Experiments on in-the-wild multi-person benchmarks such as PoseTrack and 3DPW, demonstrate that RAM substantially outperforms previous state-of-the-art in both Zero-shot tracking stability and 3D accuracy, offering a generalizable paradigm for markerless 3D human motion capture in-the-wild.

关键词: 3D human motion capture, in-the-wild, motion-aware semantic tracker, Temporal HMR, adaptive Kalman filtering, multi-person tracking, pose estimation, markerless motion reconstruction

53. ❌ Span-Level Machine Translation Meta-Evaluation

作者: Stefano Perrella, Eric Morales Agostinho, Hugo Zaragoza 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19921v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于机器翻译（MT）自动评估的元评估方法研究，特别是针对错误检测能力的评估。虽然涉及自然语言处理（NLP）和AI应用，但内容聚焦于评估指标（如精度、召回率、F-score）的设计和比较，以及提出新的元评估策略（MPP）。论文未涉及大模型（LLMs）技术原理、训练方法（如预训练、微调、对齐）、推理优化、代理系统、模型压缩等关键词，也未涉及科学领域（如生物信息学）的具体应用。因此，所有关键词均不相关，得分为0。

!!! tip deepseek-chat TL;DR

该论文研究了如何可靠评估机器翻译自动错误检测系统的性能，提出了一种名为“匹配部分重叠和部分信用”（MPP）的稳健元评估策略，并用于评估当前最先进的机器翻译错误检测技术。

摘要翻译

近年来，机器翻译（Machine Translation, MT）与自动机器翻译评估技术取得了显著进步，催生了众多创新应用。自动评估技术已从生成单一质量分数，发展到能够精确定位翻译错误并为其分配错误类别与严重性等级。然而，对于执行错误检测的自动评估工具，如何可靠地衡量其评估能力仍不明确，因为现有文献中尚未建立成熟的方法。本研究探讨了跨度级精确率、召回率与F分数的不同实现方式，结果表明看似相似的方法可能产生差异显著的排名，且某些广泛使用的技术并不适用于评估机器翻译错误检测。我们提出采用“允许部分重叠与部分赋分的匹配”（match with partial overlap and partial credit, MPP）结合微平均作为稳健的元评估策略，并公开其实现代码。最后，我们运用MPP方法对当前机器翻译错误检测的最新技术水平进行了评估。

摘要 (Abstract)

Machine Translation (MT) and automatic MT evaluation have improved dramatically in recent years, enabling numerous novel applications. Automatic evaluation techniques have evolved from producing scalar quality scores to precisely locating translation errors and assigning them error categories and severity levels. However, it remains unclear how to reliably measure the evaluation capabilities of auto-evaluators that do error detection, as no established technique exists in the literature. This work investigates different implementations of span-level precision, recall, and F-score, showing that seemingly similar approaches can yield substantially different rankings, and that certain widely-used techniques are unsuitable for evaluating MT error detection. We propose “match with partial overlap and partial credit” (MPP) with micro-averaging as a robust meta-evaluation strategy and release code for its use publicly. Finally, we use MPP to assess the state of the art in MT error detection.

关键词: Machine Translation, MT evaluation, error detection, meta-evaluation, precision recall F-score, span-level evaluation, MPP, automatic evaluation

54. ❌ Learning Like Humans: Analogical Concept Learning for Generalized Category Discovery

作者: Jizhou Han, Chenhao Ding, Yuhang He, Qiang Wang, Shaokun Wang, SongLin Dong, Yihong Gong 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19918v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究广义类别发现（GCD）任务，提出了一种类比文本概念生成器（ATCG）模块，通过从标注知识类比到新观察来生成未标注样本的文本概念，并与视觉特征融合以改进类别分离。论文主要涉及计算机视觉、多模态学习和类别发现，但未明确提及或使用大语言模型、深度学习技术原理创新或任何评分关键词中的具体技术。所有关键词均与大模型、深度学习技术原理或AI在科学领域的应用直接相关，而本文专注于视觉-文本多模态学习在类别发现中的应用，与这些关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文针对广义类别发现任务中视觉方法对细粒度相似类别边界脆弱的局限性，提出了一种类比文本概念生成器，通过融合视觉和类比生成的文本概念来改进已知和新类别的识别性能，在多个基准测试中取得了显著提升。

摘要翻译

广义类别发现（GCD）旨在从未标注数据中发掘新类别，同时保持对已知类别的识别能力。然而，当前主流的纯视觉方法以及监督学习与发现过程之间的松散耦合，往往在细粒度、外观相似的类别上产生脆弱的分类边界。我们提出了类比文本概念生成器（ATCG），这是一个即插即用模块，能够从已标注知识中类比推理新观察样本，为未标注样本生成文本概念。将这些类比文本概念与视觉特征融合后，发现过程转化为视觉-文本推理任务，从而将先验知识迁移至新数据并增强类别区分度。ATCG可适配参数化与聚类式两类GCD框架，无需改变其整体设计。在六个基准测试中，ATCG持续提升了整体、已知类别及新类别的识别性能，并在细粒度数据上取得了最显著的改进。代码已开源：https://github.com/zhou-9527/AnaLogical-GCD。

摘要 (Abstract)

Generalized Category Discovery (GCD) seeks to uncover novel categories in unlabeled data while preserving recognition of known categories, yet prevailing visual-only pipelines and the loose coupling between supervised learning and discovery often yield brittle boundaries on fine-grained, look-alike categories. We introduce the Analogical Textual Concept Generator (ATCG), a plug-and-play module that analogizes from labeled knowledge to new observations, forming textual concepts for unlabeled samples. Fusing these analogical textual concepts with visual features turns discovery into a visual-textual reasoning process, transferring prior knowledge to novel data and sharpening category separation. ATCG attaches to both parametric and clustering style GCD pipelines and requires no changes to their overall design. Across six benchmarks, ATCG consistently improves overall, known-class, and novel-class performance, with the largest gains on fine-grained data. Our code is available at: https://github.com/zhou-9527/AnaLogical-GCD.

关键词: Generalized Category Discovery, Analogical Textual Concept Generator, Visual-Textual Reasoning, Fine-grained Categories, Novel Category Discovery, Multi-modal Learning, Plug-and-play Module

55. ❌ Revealing Domain-Spatiality Patterns for Configuration Tuning: Domain Knowledge Meets Fitness Landscapes

作者: Yulong Ye, Hongyuan Liang, Chao Jiang, Miqing Li, Tao Chen 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19897v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究软件配置调优问题，采用适应度景观分析（FLA）结合领域知识来解释调优器的成功或失败原因。论文主题属于软件工程和系统优化领域，与所有评分关键词（均聚焦于大模型、深度学习技术及其应用）完全无关。论文未涉及任何大模型技术、AI科学应用或相关创新方法。

!!! tip deepseek-chat TL;DR

该论文提出了一种结合适应度景观分析和领域知识的方法（Domland）来解释软件配置调优中调优器成功或失败的原因，并通过案例研究发现配置景观具有系统特异性，核心选项对调优难度影响更大，工作负载影响因系统而异。

摘要翻译

为提升性能而进行的配置调优在质量保障中至关重要。然而，由于可配置系统的黑盒特性，调优工具的有效性长期以来一直存在谜团。先前的研究主要采用静态领域分析（如静态污点分析），其往往缺乏普适性；或采用动态数据分析（如基准性能分析），但可解释性有限。在本工作中，我们引入适应度景观分析（Fitness Landscape Analysis, FLA）作为领域知识与调优难度之间的桥梁。我们提出了Domland，一种双管齐下的方法，它协同整合从FLA获得的空间信息与领域驱动分析，以系统性地捕捉配置调优案例的隐藏特征，从而解释调优工具为何及如何成功或失败。这有助于更好地理解和情境化调优工具的行为，并为调优工具的设计提供依据。为评估Domland，我们对九个软件系统和93个工作负载进行了案例研究，从中揭示了若干关键发现：（1）配置景观本质上是系统特定的，没有单一的领域因素（如系统领域、编程语言或资源密集度）能持续塑造其结构；（2）与控制主要功能流的核心选项（如x264的pic-struct）相比，资源选项（如x264的cpu-independent）对景观崎岖度（即调优难度）的影响更强；（3）工作负载对景观结构的影响并非统一与其类型或规模绑定。二者均会导致景观变化，但其影响程度取决于具体系统。

摘要 (Abstract)

Configuration tuning for better performance is crucial in quality assurance. Yet, there has long been a mystery on tuners’ effectiveness, due to the black-box nature of configurable systems. Prior efforts predominantly adopt static domain analysis (e.g., static taint analysis), which often lacks generalizability, or dynamic data analysis (e.g., benchmarking performance analysis), limiting explainability. In this work, we embrace Fitness Landscape Analysis (FLA) as a bridge between domain knowledge and difficulty of the tuning. We propose Domland, a two-pronged methodology that synergizes the spatial information obtained from FLA and domain-driven analysis to systematically capture the hidden characteristics of configuration tuning cases, explaining how and why a tuner might succeed or fail. This helps to better interpret and contextualize the behavior of tuners and inform tuner design. To evaluate Domland, we conduct a case study of nine software systems and 93 workloads, from which we reveal several key findings: (1) configuration landscapes are inherently system-specific, with no single domain factor (e.g., system area, programming language, or resource intensity) consistently shaping their structure; (2) the core options (e.g., pic-struct of x264), which control the main functional flows, exert a stronger influence on landscape ruggedness (i.e. the difficulty of tuning) compared to resource options (e.g., cpu-independent of x264); (3) Workload effects on landscape structure are not uniformly tied to type or scale. Both contribute to landscape variations, but their impact is system-dependent.

关键词: Configuration Tuning, Fitness Landscape Analysis, Domain Knowledge, Tuner Effectiveness, Software Systems, Workload Analysis, System-specific Landscapes, Core Options

56. ❌ Utility-Guided Agent Orchestration for Efficient LLM Tool Use

作者: Boyan Liu, Gongming Zhao, Hongli Xu 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19896v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM agent的orchestration policy，直接涉及LLM agents和tool use，因此这两个关键词高度相关（10分）。论文提到ReAct等multi-step reasoning方法，与Chain of Thought有一定关联（5分）。其他关键词如MoE、SLMs、Scaling Laws、Pre-training、RAG、Quantization等均未在摘要中提及，完全无关（0分）。

!!! tip deepseek-chat TL;DR

该论文研究了工具使用型LLM代理在质量与执行成本之间的权衡问题，提出了一个基于效用的编排策略框架，通过平衡估计收益、步骤成本、不确定性和冗余来控制代理行为，实验表明显式编排信号能显著影响代理行为。

摘要翻译

使用工具的大型语言模型（LLM）智能体常常面临答案质量与执行成本之间的根本性矛盾。固定工作流程稳定但缺乏灵活性，而自由形式的多步推理方法（如ReAct）虽可能提升任务表现，却往往以过多的工具调用、更长的执行轨迹、更高的令牌消耗以及增加的延迟为代价。本文中，我们将智能体编排视为一个明确的决策问题进行研究，而非完全交由提示层面的行为决定。我们提出了一种基于效用引导的编排策略，该策略通过平衡预估收益、步骤成本、不确定性和冗余度，在响应、检索、工具调用、验证和停止等动作中进行选择。我们的目标并非宣称获得普适性的最佳任务表现，而是为研究使用工具的LLM智能体在质量与成本间的权衡提供一个可控且可分析的策略框架。在直接回答、阈值控制、固定工作流、ReAct以及多种策略变体的实验中，明确的编排信号显著影响了智能体行为。关于成本定义、工作流公平性和冗余控制的进一步分析表明，轻量级的效用设计能够为智能体控制提供一个合理且实用的机制。

摘要 (Abstract)

Tool-using large language model (LLM) agents often face a fundamental tension between answer quality and execution cost. Fixed workflows are stable but inflexible, while free-form multi-step reasoning methods such as ReAct may improve task performance at the expense of excessive tool calls, longer trajectories, higher token consumption, and increased latency. In this paper, we study agent orchestration as an explicit decision problem rather than leaving it entirely to prompt-level behavior. We propose a utility-guided orchestration policy that selects among actions such as respond, retrieve, tool call, verify, and stop by balancing estimated gain, step cost, uncertainty, and redundancy. Our goal is not to claim universally best task performance, but to provide a controllable and analyzable policy framework for studying quality-cost trade-offs in tool-using LLM agents. Experiments across direct answering, threshold control, fixed workflows, ReAct, and several policy variants show that explicit orchestration signals substantially affect agent behavior. Additional analyses on cost definitions, workflow fairness, and redundancy control further demonstrate that lightweight utility design can provide a defensible and practical mechanism for agent control.

关键词: LLM agents, tool use, agent orchestration, utility-guided policy, quality-cost trade-off, ReAct, multi-step reasoning, execution cost

57. ❌ Integrating Meta-Features with Knowledge Graph Embeddings for Meta-Learning

作者: Antonis Klironomos, Ioannis Dasoulas, Francesco Periti, Mohamed Gad-Elrab, Heiko Paulheim, Anastasia Dimou, Evgeny Kharlamov 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19888v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是元学习（meta-learning）中的知识图谱嵌入方法，专注于机器学习实验记录的整合和利用，用于管道性能估计和数据集相似性估计。论文内容完全不涉及大模型、深度学习技术原理、科学领域AI应用或任何评分关键词中的具体技术（如LLM、MoE、RLHF、RAG等）。所有关键词均与大模型技术、训练方法、推理优化、对齐、压缩、科学应用等主题相关，而本文属于传统机器学习元学习范畴，与这些主题无关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于知识图谱嵌入的方法KGmetaSP，利用现有实验数据改进元学习中的管道性能估计和数据集相似性估计，并在大规模OpenML基准上验证了其有效性。

摘要翻译

网络上可获取的大量机器学习记录为元学习提供了重要机遇，即利用历史实验来提升性能。其中两项关键的元学习任务是：流水线性能估计（PPE，用于预测流水线在目标数据集上的性能）和基于数据集性能的相似性估计（DPSE，用于识别具有相似性能模式的数据集）。现有方法主要依赖数据集元特征（例如实例数量、类别熵等）对数据集进行数值化表示，以近似处理这些元学习任务。然而，这些方法往往忽略了可用的丰富历史实验结果和流水线元数据，从而限制了其捕捉能够揭示性能相似性模式的数据集-流水线交互的能力。本研究提出KGmetaSP，一种基于知识图谱嵌入的方法，该方法利用现有实验数据来捕捉这些交互，并同时改进PPE和DPSE。我们将数据集和流水线在一个统一的知识图谱（KG）中进行表示，并推导出嵌入表示，这些嵌入支持用于PPE的与流水线无关的元模型，以及用于DPSE的基于距离的检索。为验证我们的方法，我们构建了一个包含144,177个OpenML实验的大规模基准测试，从而实现了丰富的跨数据集评估。KGmetaSP能够使用单一的、与流水线无关的元模型实现精确的PPE，并在DPSE上超越了基线方法。所提出的KGmetaSP方法、知识图谱及基准测试均已公开发布，为元学习设立了新的参考基准，并展示了将开放实验数据整合到统一知识图谱中如何推动该领域的发展。

摘要 (Abstract)

The vast collection of machine learning records available on the web presents a significant opportunity for meta-learning, where past experiments are leveraged to improve performance. Two crucial meta-learning tasks are pipeline performance estimation (PPE), which predicts pipeline performance on target datasets, and dataset performance-based similarity estimation (DPSE), which identifies datasets with similar performance patterns. Existing approaches primarily rely on dataset meta-features (e.g., number of instances, class entropy, etc.) to represent datasets numerically and approximate these meta-learning tasks. However, these approaches often overlook the wealth of past experimental results and pipeline metadata available. This limits their ability to capture dataset - pipeline interactions that reveal performance similarity patterns. In this work, we propose KGmetaSP, a knowledge-graph-embeddings approach that leverages existing experiment data to capture these interactions and improve both PPE and DPSE. We represent datasets and pipelines within a unified knowledge graph (KG) and derive embeddings that support pipeline-agnostic meta-models for PPE and distance-based retrieval for DPSE. To validate our approach, we construct a large-scale benchmark comprising 144,177 OpenML experiments, enabling a rich cross-dataset evaluation. KGmetaSP enables accurate PPE using a single pipeline-agnostic meta-model and improves DPSE over baselines. The proposed KGmetaSP, KG, and benchmark are released, establishing a new reference point for meta-learning and demonstrating how consolidating open experiment data into a unified KG advances the field.

关键词: meta-learning, knowledge graph embeddings, pipeline performance estimation, dataset similarity, OpenML experiments, KGmetaSP, machine learning records, cross-dataset evaluation

58. ❌ What If Consensus Lies? Selective-Complementary Reinforcement Learning at Test Time

作者: Dong Yan, Jian Liang, Yanbo Wang, Shuo Lu, Ran He, Tieniu Tan 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19880v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究Test-Time Reinforcement Learning (TTRL)框架SCRL，旨在提升LLMs在测试时的推理能力。高度相关关键词包括：1) ‘Large Language Models’ (论文明确研究LLMs)；2) ‘Chain of Thought’和’System 2 Thinking’ (论文聚焦推理能力增强)；3) ‘Self-Correction’ (SCRL通过负伪标签机制纠正错误轨迹)；4) ‘Hallucination Mitigation’ (解决共识错误导致的错误强化问题)。其他关键词如MoE、量化、科学AI等未涉及，评分为0。

!!! tip deepseek-chat TL;DR

该论文针对测试时强化学习中多数投票共识可能强化错误轨迹的问题，提出了选择性-互补强化学习框架SCRL，通过严格共识筛选和基于不确定性的负监督机制，显著提升了LLMs在多个推理基准上的性能并保持了鲁棒性。

摘要翻译

测试时强化学习（Test-Time Reinforcement Learning, TTRL）使大型语言模型（LLMs）能够通过从多数投票共识中推导伪奖励，从而在未标注的测试流上增强推理能力。然而，现有的TTRL方法仅依赖于正向伪标签策略。这种依赖在答案分布高度分散的挑战性场景下变得脆弱，导致弱共识无意中将错误轨迹作为监督信号进行强化。本文提出SCRL（选择性-互补强化学习），一种鲁棒的测试时强化学习框架，能有效缓解标签噪声放大问题。SCRL设计了选择性正向伪标签策略，通过严格的共识标准过滤不可靠的多数结果。互补地，SCRL引入了熵门控负向伪标签策略——这是TTRL中首个负向监督机制——基于生成不确定性可靠地修剪错误轨迹。在多个推理基准上的大量实验表明，SCRL相较于基线方法实现了显著提升，同时在受限的推演预算下保持了鲁棒的泛化能力和训练稳定性。我们的代码发布于https://github.com/Jasper-Yan/SCRL。

摘要 (Abstract)

Test-Time Reinforcement Learning (TTRL) enables Large Language Models (LLMs) to enhance reasoning capabilities on unlabeled test streams by deriving pseudo-rewards from majority voting consensus. However, existing TTRL methods rely exclusively on positive pseudo-labeling strategies. Such reliance becomes vulnerable under challenging scenarios where answer distributions are highly dispersed, resulting in weak consensus that inadvertently reinforces incorrect trajectories as supervision signals. In this paper, we propose SCRL (Selective-Complementary Reinforcement Learning), a robust test-time reinforcement learning framework that effectively mitigates label noise amplification. SCRL develops Selective Positive Pseudo-Labeling, which enforces strict consensus criteria to filter unreliable majorities. Complementarily, SCRL introduces Entropy-Gated Negative Pseudo-Labeling, the first negative supervision mechanism in TTRL, to reliably prune incorrect trajectories based on generation uncertainty. Extensive experiments on multiple reasoning benchmarks demonstrate that SCRL achieves substantial improvements over baselines, while maintaining robust generalization and training stability under constrained rollout budgets. Our code is available at https://github.com/Jasper-Yan/SCRL.

关键词: Test-Time Reinforcement Learning, Large Language Models, Reasoning Capabilities, Majority Voting Consensus, Selective-Complementary Reinforcement Learning, Negative Pseudo-Labeling, Hallucination Mitigation, Self-Correction

59. ❌ Failure Modes for Deep Learning-Based Online Mapping: How to Measure and Address Them

作者: Michael Hubbertz, Qi Han, Tobias Meisen 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19852v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于自动驾驶中的深度学习在线地图构建，研究其泛化失败模式、评估方法和数据集设计。所有评分关键词均与大语言模型（LLM）及相关技术（如MoE、SFT、RAG、量化、推理加速等）或AI for Science（生物信息学、化学信息学）直接相关。论文内容完全不涉及LLM、大模型技术或科学AI应用，而是纯计算机视觉/自动驾驶领域的深度学习模型泛化问题，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文研究了基于深度学习的在线地图构建模型在陌生环境中的泛化失败问题，提出了一个通过解耦记忆效应和几何过拟合来识别和测量失败模式的框架，并开发了基于地图几何多样性的数据集平衡策略，实验表明该方法能提高模型评估的可靠性和性能。

摘要翻译

基于深度学习的在线建图已成为自动驾驶的基石，但这些模型往往难以泛化至熟悉环境之外的场景。我们提出了一个框架，通过解耦两种效应来识别和衡量其底层失效模式：对输入特征的记忆效应以及对已知地图几何结构的过拟合。我们提出了基于评估子集的度量方法，这些子集控制了训练场景与验证场景之间的地理邻近性和几何相似性。我们引入了基于弗雷歇距离的重建统计量，无需阈值调优即可捕捉每个元素的形状保真度，并定义了互补的失效模式评分：定位过拟合评分（用于量化当地理线索消失时的性能下降）和地图几何过拟合评分（用于衡量场景几何新颖性增加时的性能退化）。除了模型分析，我们还研究了数据集偏差，并贡献了地图几何感知的诊断工具：用于训练集的最小生成树（MST）多样性度量，以及用于量化数据划分间几何相似性的对称覆盖度量。基于这些工具，我们提出了一种基于MST的稀疏化策略，该策略能在缩小训练集规模的同时减少冗余、改善平衡性并提升性能。在nuScenes和Argoverse 2数据集上对多个前沿模型进行的实验，为泛化能力提供了更可信的评估，并表明具有地图几何多样性且平衡的训练集能够带来性能提升。我们的研究结果启示了面向可部署在线建图的失效模式感知评估协议以及以地图几何为核心的数据集设计。

摘要 (Abstract)

Deep learning-based online mapping has emerged as a cornerstone of autonomous driving, yet these models frequently fail to generalize beyond familiar environments. We propose a framework to identify and measure the underlying failure modes by disentangling two effects: Memorization of input features and overfitting to known map geometries. We propose measures based on evaluation subsets that control for geographical proximity and geometric similarity between training and validation scenes. We introduce Fréchet distance-based reconstruction statistics that capture per-element shape fidelity without threshold tuning, and define complementary failure-mode scores: a localization overfitting score quantifying the performance drop when geographic cues disappear, and a map geometry overfitting score measuring degradation as scenes become geometrically novel. Beyond models, we analyze dataset biases and contribute map geometry-aware diagnostics: A minimum-spanning-tree (MST) diversity measure for training sets and a symmetric coverage measure to quantify geometric similarity between splits. Leveraging these, we formulate an MST-based sparsification strategy that reduces redundancy and improves balancing and performance while shrinking training size. Experiments on nuScenes and Argoverse 2 across multiple state-of-the-art models yield more trustworthy assessment of generalization and show that map geometry-diverse and balanced training sets lead to improved performance. Our results motivate failure-mode-aware protocols and map geometry-centric dataset design for deployable online mapping.

关键词: deep learning-based online mapping, failure modes, generalization, map geometry overfitting, dataset bias, Fréchet distance, minimum-spanning-tree diversity, autonomous driving

60. ❌ Semantic Delta: An Interpretable Signal Differentiating Human and LLMs Dialogue

作者: Riccardo Scantamburlo, Mauro Mezzanzana, Giacomo Buonanno, Francesco Bertolotti 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19849v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM生成文本与人类对话的区分方法，提出基于语义分布差异的轻量级指标，属于LLM行为分析和检测技术范畴。与"Large Language Models"高度相关（10分），因为论文直接研究LLM生成文本的特征；与"Mechanistic Interpretability OR Explainable AI"有一定关联（5分），因为提出的语义delta指标旨在提供可解释的统计特征来理解LLM与人类对话的差异；其他关键词涉及模型架构、训练方法、推理优化、应用领域等，论文均未涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文研究如何区分人类和LLM生成的对话，通过提出基于语义类别分布的轻量级指标"semantic delta"，发现AI生成文本比人类对话具有更强的主题集中度和更僵化的主题结构。

摘要翻译

大型语言模型（LLM）的对话方式与我们相似吗？这一问题引起了众多学者的兴趣，并在从教育到学术界的多个领域具有重要意义。本研究提出了一种可解释的统计特征，用于区分人类书写和LLM生成的对话。我们引入了一种基于语义类别分布推导的轻量级度量指标。利用Empath词汇分析框架，每个文本被映射为一组主题强度分数。我们将“语义差值”定义为对话中两个最主导类别强度之间的差异，并假设LLM输出比人类对话表现出更强的主题集中性。为验证这一假设，我们从多种LLM配置生成对话数据，并将其与多样化的人类语料（包括剧本对话、文学作品和在线讨论）进行比较。对所得语义差值分布进行韦尔奇t检验的结果表明，AI生成的文本始终产生比人类文本更高的语义差值，意味着其话题结构更为僵化；而人类对话则展现出更广泛且更均衡的语义分布。本研究所提出的零样本度量指标并非旨在取代现有检测技术，而是提供一种计算成本低廉的补充信号，可集成至组合检测系统中。这些发现亦有助于从更广泛的实证角度理解LLM的行为模仿能力，并表明主题分布构成一个可量化的维度，当前模型在该维度上尚未达到人类对话的动态性水平。

摘要 (Abstract)

Do LLMs talk like us? This question intrigues a multitude of scholar and it is relevant in many fields, from education to academia. This work presents an interpretable statistical feature for distinguishing human written and LLMs generated dialogue. We introduce a lightweight metric derived from semantic categories distribution. Using the Empath lexical analysis framework, each text is mapped to a set of thematic intensity scores. We define semantic delta as the difference between the two most dominant category intensities within a dialogue, hypothesizing that LLM outputs exhibit stronger thematic concentration than human discourse. To evaluate this hypothesis, conversational data were generated from multiple LLM configurations and compared against heterogeneous human corpora, including scripted dialogue, literary works, and online discussions. A Welch t-test was applied to the resulting distributions of semantic delta values. Results show that AI-generated texts consistently produce higher deltas than human texts, indicating a more rigid topics structure, whereas human dialogue displays a broader and more balanced semantic spread. Rather than replacing existing detection techniques, the proposed zero-shot metric provides a computationally inexpensive complementary signal that can be integrated into ensemble detection systems. These finding also contribute to the broader empirical understanding of LLM behavioural mimicry and suggest that thematic distribution constitutes a quantifiable dimension along which current models fall short of human conversational dynamics.

关键词: LLM dialogue detection, semantic delta, interpretable metric, thematic concentration, human vs AI text, Empath framework, zero-shot detection, behavioral mimicry

61. ❌ Gesture2Speech: How Far Can Hand Movements Shape Expressive Speech?

作者: Lokesh Kumar, Nirmesh Shah, Ashishkumar P. Gudmalwar, Pankaj Wasnik 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19831v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心创新在于：1）首次将手势用于语音合成韵律控制，属于大模型在跨模态应用中的创新（LLM相关度10分）；2）提出多模态Mixture-of-Experts架构，是MoE技术的具体应用（MoE相关度10分）。其他关键词如SLMs、Scaling Laws、Alignment等均未在摘要中提及或涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文研究了如何利用手势动作来调制合成语音的韵律，提出了一种新颖的多模态Mixture-of-Experts框架（Gesture2Speech），通过融合语言内容和手势特征，实现了手势与语音韵律的时序对齐，在PATS数据集上超越了现有基线方法。

摘要翻译

人类交流无缝整合了语音与身体动作，其中手势自然地补充了语音韵律以表达意图、情感和强调。尽管近期文本转语音系统已开始融入面部表情或唇部运动等多模态线索，但手势在塑造韵律方面的作用在很大程度上仍未得到充分探索。我们提出了一种新颖的多模态文本转语音框架——Gesture2Speech，该框架利用视觉手势线索来调制合成语音的韵律。受自信且富有表现力的说话者会将手势与语音韵律协调配合这一观察的启发，我们引入了一种多模态混合专家架构，该架构在专用的风格提取模块中动态融合语言内容和手势特征。融合后的表征条件化一个基于大语言模型的语音解码器，从而实现与手部运动在时间上对齐的韵律调制。我们进一步设计了一种手势-语音对齐损失函数，显式建模二者之间的时间对应关系，以确保手势与韵律轮廓之间的细粒度同步性。在PATS数据集上的评估表明，Gesture2Speech在语音自然度和手势-语音同步性方面均优于现有先进基线方法。据我们所知，这是首个在神经语音合成中利用手势线索进行韵律控制的研究工作。演示样本可在 https://research.sri-media-analysis.com/aaai26-beeu-gesture2speech/ 获取。

摘要 (Abstract)

Human communication seamlessly integrates speech and bodily motion, where hand gestures naturally complement vocal prosody to express intent, emotion, and emphasis. While recent text-to-speech (TTS) systems have begun incorporating multimodal cues such as facial expressions or lip movements, the role of hand gestures in shaping prosody remains largely underexplored. We propose a novel multimodal TTS framework, Gesture2Speech, that leverages visual gesture cues to modulate prosody in synthesized speech. Motivated by the observation that confident and expressive speakers coordinate gestures with vocal prosody, we introduce a multimodal Mixture-of-Experts (MoE) architecture that dynamically fuses linguistic content and gesture features within a dedicated style extraction module. The fused representation conditions an LLM-based speech decoder, enabling prosodic modulation that is temporally aligned with hand movements. We further design a gesture-speech alignment loss that explicitly models their temporal correspondence to ensure fine-grained synchrony between gestures and prosodic contours. Evaluations on the PATS dataset show that Gesture2Speech outperforms state-of-the-art baselines in both speech naturalness and gesture-speech synchrony. To the best of our knowledge, this is the first work to utilize hand gesture cues for prosody control in neural speech synthesis. Demo samples are available at https://research.sri-media-analysis.com/aaai26-beeu-gesture2speech/

关键词: Gesture2Speech, multimodal TTS, hand gestures, prosody modulation, Mixture-of-Experts, LLM-based speech decoder, gesture-speech alignment, neural speech synthesis

62. ❌ FormalEvolve: Neuro-Symbolic Evolutionary Search for Diverse and Prover-Effective Autoformalization

作者: Haijian Lu, Wei Wang, Jing Liu 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19828v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出FormalEvolve框架，使用LLM驱动变异和交叉生成多样化候选形式化，属于大模型在科学领域的应用（数学自动形式化），因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。该研究属于AI for Science范畴，具体应用于数学形式化验证，与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。其他关键词如MoE、SFT、RAG、推理加速等均未在摘要中提及或与论文核心内容无关，故评0分。

!!! tip deepseek-chat TL;DR

该论文研究了如何通过神经符号进化搜索框架FormalEvolve，利用LLM生成多样化且证明有效的数学自动形式化，在CombiBench和ProofNet数据集上显著提高了语义命中率和下游证明性能。

摘要翻译

自动形式化旨在将自然语言数学转化为可编译、可机器验证的语句。然而，语义一致性并不意味着证明器有效性：即使语义一致的形式化结果，在证明搜索成本和成功率上也可能存在显著差异。本研究将自动形式化定义为在有限预算下对语义一致的形式化方案进行测试时搜索的过程，并提出FormalEvolve——一个基于编译门控的神经符号进化框架。该框架通过大语言模型驱动的变异与交叉操作（辅以有界补丁修复）生成多样化候选形式化，同时利用符号化的抽象语法树（AST）重写操作进一步注入结构多样性。在CombiBench和ProofNet数据集上，在严格限定生成器调用次数T=100的条件下，FormalEvolve分别达到58.0%和84.9%的语义命中率（SH@100），并降低了语义成功案例在跨问题上的集中度（基尼系数更低）。在固定证明器资源条件下，FormalEvolve还提升了在CombiBench上的下游证明性能。代码将公开发布。

摘要 (Abstract)

Autoformalization aims to translate natural-language mathematics into compilable, machine-checkable statements. However, semantic consistency does not imply prover effectiveness: even semantically consistent formalizations can differ substantially in proof-search cost and success rate. In this work, we formulate autoformalization as a budgeted, test-time search for semantically consistent repertoires, and propose FormalEvolve, a compilation-gated neuro-symbolic evolutionary framework. FormalEvolve generates diverse candidates via LLM-driven mutation and crossover with bounded patch repair, while symbolic Abstract Syntax Tree (AST) rewrite operations further inject structural diversity. On CombiBench and ProofNet, under a strict generator-call budget of T = 100, FormalEvolve reaches semantic hit rates (SH@100) of 58.0% and 84.9%, and reduces cross-problem concentration of semantic successes(lower Gini). Under a fixed prover budget, FormalEvolve also improves downstream proving performance on CombiBench. Code will be released publicly.

关键词: Autoformalization, Neuro-Symbolic Evolutionary Search, LLM-driven mutation, Abstract Syntax Tree, Semantic consistency, Prover effectiveness, FormalEvolve, ProofNet

63. ❌ FrameNet Semantic Role Classification by Analogy

作者: Van-Duy Ngo, Stergos Afantenos, Emiliano Lorini, Miguel Couceiro 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19825v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究FrameNet语义角色分类的类比方法，使用轻量级人工神经网络进行二元分类，并通过随机采样和类比转移恢复语义角色。论文内容专注于自然语言处理中的语义角色标注和类比推理，未涉及大模型、深度学习技术原理创新或大模型在不同领域的应用。所有评分关键词均与大模型技术、训练方法、推理优化、应用领域等相关，而本文研究的是传统NLP任务中的特定方法，与这些关键词完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于类比关系的FrameNet语义角色分类方法，通过将语义角色分类转化为二元分类问题并训练轻量级神经网络，实现了超越先前最优结果的性能，同时保持了计算效率和资源节约。

摘要翻译

本文采用一种关系视角下的类比方法，应用于框架语义网（FrameNet）中的语义角色分类任务。我们将类比定义为框架触发词单元（LUs）与框架元素（FEs）对笛卡尔积上的形式关系，并利用此定义构建了一个新的数据集。该二元关系中的每个实例，若其框架元素共享相同的语义角色，则标记为有效类比实例，否则标记为无效。这一形式化方式使我们能够将语义角色分类转化为二分类问题，并训练一个轻量级人工神经网络（ANN）。该网络仅需极少参数即可实现快速收敛。与传统方法不同，在训练过程中我们未向神经网络引入任何语义角色信息。在推理阶段，我们通过随机采样与类比迁移，计算给定框架内所有候选语义角色的概率分布，从而还原语义角色。该方法在保持计算高效性与经济性的同时，超越了以往的最优性能。

摘要 (Abstract)

In this paper, we adopt a relational view of analogies applied to Semantic Role Classification in FrameNet. We define analogies as formal relations over the Cartesian product of frame evoking lexical units (LUs) and frame element (FEs) pairs, which we use to construct a new dataset. Each element of this binary relation is labelled as a valid analogical instance if the frame elements share the same semantic role, or as invalid otherwise. This formulation allows us to transform Semantic Role Classification into binary classification and train a lightweight Artificial Neural Network (ANN) that exhibits rapid convergence with minimal parameters. Unconventionally, no Semantic Role information is introduced to the neural network during training. We recover semantic roles during inference by computing probability distributions over candidates of all semantic roles within a given frame through random sampling and analogical transfer. This approach allows us to surpass previous state-of-the-art results while maintaining computational efficiency and frugality.

关键词: FrameNet, Semantic Role Classification, Analogy, Artificial Neural Network, Binary Classification, Analogical Transfer, Computational Efficiency

64. ❌ Learning Hierarchical Orthogonal Prototypes for Generalized Few-Shot 3D Point Cloud Segmentation

作者: Yifei Zhao, Fanyu Zhao, Zhongyuan Zhang, Shengtang Wu, Yixuan Lin, Yinsheng Li 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19788v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于3D点云分割的计算机视觉任务，提出了一种名为HOP3D的框架来解决广义少样本学习中的稳定性-可塑性权衡问题。论文的核心技术涉及层次正交原型学习、基于熵的正则化、梯度解耦和表示解耦等计算机视觉和机器学习方法。所有给定的关键词都直接与大语言模型（LLMs）、深度学习技术原理或特定AI应用领域（如生物信息学）相关，而本文研究的是3D点云分割，属于计算机视觉中的几何深度学习范畴，与关键词列表中的主题没有直接关联。因此，所有关键词的相关度评分均为0。

!!! tip deepseek-chat TL;DR

该论文提出了HOP3D框架，通过层次正交原型学习和基于熵的正则化，解决了广义少样本3D点云分割中适应新类别而不损害基类性能的稳定性-可塑性权衡问题，并在ScanNet200和ScanNet++数据集上取得了优于现有方法的结果。

摘要翻译

广义少样本三维点云分割的目标是在仅利用少量标注样本的情况下适应新类别，同时保持对基类别的强性能表现，但由于固有的稳定性-可塑性权衡，这一任务仍具挑战性：适应新类别可能会干扰共享表征并导致基类别遗忘。本文提出HOP3D，一个通过基于熵的少样本正则化器学习层次化正交原型的统一框架，能够在实现稳健新类别适应的同时不降低基类别性能。HOP3D引入了层次化正交化机制，在梯度和表征层面解耦基类别与新类别的学习过程，有效缓解了基-新类别间的相互干扰。为进一步增强稀疏监督下的适应能力，我们融入了一种基于熵的正则化器，该正则化器利用预测不确定性来优化原型学习并促进平衡的预测结果。在ScanNet200和ScanNet++数据集上的大量实验表明，HOP3D在1样本和5样本设置下均持续优于现有先进基线方法。代码发布于https://fdueblab-hop3d.github.io/。

摘要 (Abstract)

Generalized few-shot 3D point cloud segmentation aims to adapt to novel classes from only a few annotations while maintaining strong performance on base classes, but this remains challenging due to the inherent stability-plasticity trade-off: adapting to novel classes can interfere with shared representations and cause base-class forgetting. We present HOP3D, a unified framework that learns hierarchical orthogonal prototypes with an entropy-based few-shot regularizer to enable robust novel-class adaptation without degrading base-class performance. HOP3D introduces hierarchical orthogonalization that decouples base and novel learning at both the gradient and representation levels, effectively mitigating base-novel interference. To further enhance adaptation under sparse supervision, we incorporate an entropy-based regularizer that leverages predictive uncertainty to refine prototype learning and promote balanced predictions. Extensive experiments on ScanNet200 and ScanNet++ demonstrate that HOP3D consistently outperforms state-of-the-art baselines under both 1-shot and 5-shot settings. The code is available at https://fdueblab-hop3d.github.io/.

关键词: 3D point cloud segmentation, few-shot learning, hierarchical orthogonal prototypes, stability-plasticity trade-off, base-novel interference, entropy-based regularizer, generalized few-shot learning, HOP3D

65. ❌ Offshore oil and gas platform dynamics in the North Sea, Gulf of Mexico, and Persian Gulf: Exploiting the Sentinel-1 archive

作者: Robin Spanier, Thorsten Hoeser, John Truckenbrodt, Felix Bachofer, Claudia Kuenzer 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19801v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文主要利用Sentinel-1卫星数据和深度学习进行海上石油天然气平台的时空检测与监测，属于地球观测和遥感应用领域。论文中明确提到了“deep learning-based object detection”，因此与“AI for Science OR Bioinformatics OR Cheminformatics”有一定关联（评5分），因为AI for Science是一个宽泛的类别，涵盖了AI在科学领域的应用，包括地球科学和遥感。然而，论文完全不涉及大语言模型（LLMs）、模型架构（如MoE）、训练方法（如预训练、微调、对齐）、推理优化、代理系统或任何其他与大模型技术直接相关的主题。所有其他关键词均与论文内容无关，评0分。

!!! tip deepseek-chat TL;DR

该研究利用Sentinel-1卫星数据和深度学习，自动化监测了2017-2025年间北海、墨西哥湾和波斯湾的离岸油气平台动态，识别了3728个平台并分析了其数量变化、安装/退役趋势及结构转型。

摘要翻译

海上基础设施（包括石油和天然气平台）对海洋空间的日益利用，凸显了持续、可扩展监测的必要性。海上开发具有经济、环境和监管方面的影响，但由于海域难以进入且空间范围广阔，系统化监测仍然困难。本研究提出了一种基于免费地球观测数据的海上油气平台时空自动探测方法。利用Sentinel-1存档数据和基于深度学习的物体检测技术，为2017年至2025年期间三个主要产区——北海、墨西哥湾和波斯湾——创建了连续的平台位置季度时间序列。此外，还提取了平台尺寸、水深、距海岸距离、所属国家以及安装和退役日期等信息。2025年共识别出3,728个海上平台，其中北海356个，墨西哥湾1,641个，波斯湾1,731个。尽管波斯湾的平台数量在2024年前持续增长，但墨西哥湾和北海的平台数量在2018-2020年间有所下降。同时，平台动态变化显著：超过2,700个平台被安装或迁移至新址，而类似数量的平台被退役或迁移。此外，寿命较短平台数量的增加，表明海上行业正经历结构性变化，这与自升式平台或钻井船等移动式海上单元的重要性日益增长相关。研究结果凸显了免费地球观测数据和深度学习在持续、长期监测海洋基础设施方面的潜力。所生成的数据集已公开，为海上监测、海洋规划以及海上能源行业转型分析提供了基础。

摘要 (Abstract)

The increasing use of marine spaces by offshore infrastructure, including oil and gas platforms, underscores the need for consistent, scalable monitoring. Offshore development has economic, environmental, and regulatory implications, yet maritime areas remain difficult to monitor systematically due to their inaccessibility and spatial extent. This study presents an automated approach to the spatiotemporal detection of offshore oil and gas platforms based on freely available Earth observation data. Leveraging Sentinel-1 archive data and deep learning-based object detection, a consistent quarterly time series of platform locations for three major production regions: the North Sea, the Gulf of Mexico, and the Persian Gulf, was created for the period 2017-2025. In addition, platform size, water depth, distance to the coast, national affiliation, and installation and decommissioning dates were derived. 3,728 offshore platforms were identified in 2025, 356 in the North Sea, 1,641 in the Gulf of Mexico, and 1,731 in the Persian Gulf. While expansion was observed in the Persian Gulf until 2024, the Gulf of Mexico and the North Sea saw a decline in platform numbers from 2018-2020. At the same time, a pronounced dynamic was apparent. More than 2,700 platforms were installed or relocated to new sites, while a comparable number were decommissioned or relocated. Furthermore, the increasing number of platforms with short lifespans points to a structural change in the offshore sector associated with the growing importance of mobile offshore units such as jack-ups or drillships. The results highlighted the potential of freely available Earth observation data and deep learning for consistent, long-term monitoring of marine infrastructure. The derived dataset is public and provides a basis for offshore monitoring, maritime planning, and analyses of the transformation of the offshore energy sector.

关键词: offshore oil and gas platforms, Sentinel-1, deep learning, object detection, spatiotemporal monitoring, Earth observation, marine infrastructure, platform dynamics

66. ❌ MOSS-TTSD: Text to Spoken Dialogue Generation

作者: Yuqian Zhang, Donghua Yu, Zhengyuan Lin, Botian Jiang, Mingshu Chen, Yaozhou Jiang, Yiwei Zhao, Yiyang Zhang, Yucheng Yuan, Hanfu Chen, Kexin Huang, Jun Zhan, Cheng Chang, Zhaoye Fei, Shimin Li, Xiaogui Yang, Qinyuan Cheng, Xipeng Qiu 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19739v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	5.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文MOSS-TTSD专注于语音对话生成，属于语音合成领域，而非大语言模型或深度学习技术原理的创新。摘要中提到的’enhanced long-context modeling’与关键词’Context Window Extension OR Long Context LLMs’有一定关联，因为两者都涉及长上下文处理，但论文针对的是语音对话的上下文建模，而非LLM的上下文扩展，因此给予5分。其他关键词均与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了MOSS-TTSD模型，解决了长形式、多说话人语音对话生成中对话上下文建模不足的问题，并在多语言支持、零样本语音克隆和客观评估方面超越了现有基线。

摘要翻译

口语对话生成在播客、动态解说与娱乐内容等应用中至关重要，但与单句文本转语音（TTS）相比面临显著挑战。其核心需求包括准确的说话人轮换、跨轮次声学一致性以及长时程稳定性，而现有模型常因缺乏对话上下文建模能力难以满足这些要求。为弥补这一不足，我们提出了MOSS-TTSD——一种为多语言、富有表现力的多方对话语音设计的口语对话合成模型。通过增强的长上下文建模能力，MOSS-TTSD能够根据带有明确说话人标签的对话脚本生成长时程口语对话，支持长达60分钟的单次合成、最多5位说话人的多方对话，以及基于短参考音频片段的零样本语音克隆。该模型支持包括英语和汉语在内的多种主流语言，并适配于多个长时程应用场景。此外，针对现有评估方法的局限性，我们提出了TTSD-eval：一种基于强制对齐的客观评估框架，可在不依赖说话人日志工具的情况下，量化说话人归属准确率与说话人相似度。主客观评估结果均表明，MOSS-TTSD在对话合成任务上超越了现有开源与商业基线模型。

摘要 (Abstract)

Spoken dialogue generation is crucial for applications like podcasts, dynamic commentary, and entertainment content, but poses significant challenges compared to single-utterance text-to-speech (TTS). Key requirements include accurate turn-taking, cross-turn acoustic consistency, and long-form stability, which current models often fail to address due to a lack of dialogue context modeling. To bridge this gap, we present MOSS-TTSD, a spoken dialogue synthesis model designed for expressive, multi-party conversational speech across multiple languages. With enhanced long-context modeling, MOSS-TTSD generates long-form spoken conversations from dialogue scripts with explicit speaker tags, supporting up to 60 minutes of single-pass synthesis, multi-party dialogue with up to 5 speakers, and zero-shot voice cloning from a short reference audio clip. The model supports various mainstream languages, including English and Chinese, and is adapted to several long-form scenarios. Additionally, to address limitations of existing evaluation methods, we propose TTSD-eval, an objective evaluation framework based on forced alignment that measures speaker attribution accuracy and speaker similarity without relying on speaker diarization tools. Both objective and subjective evaluation results show that MOSS-TTSD surpasses strong open-source and proprietary baselines in dialogue synthesis.

关键词: spoken dialogue generation, text-to-speech, long-context modeling, multi-party dialogue, zero-shot voice cloning, TTSD-eval, speaker attribution, multi-language support

67. ❌ Uncertainty-aware Prototype Learning with Variational Inference for Few-shot Point Cloud Segmentation

作者: Yifei Zhao, Fanyu Zhao, Yinsheng Li 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19757v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是3D点云的少样本语义分割，提出了一种基于变分推断的不确定性感知原型学习方法。虽然属于计算机视觉和深度学习的范畴，但所有评分关键词都专门针对大语言模型（LLM）及其相关技术（如MoE、RLHF、RAG、量化等），或AI在科学领域的特定应用（如生物信息学）。论文内容完全不涉及语言模型、大模型技术原理、或AI在科学领域的应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种不确定性感知原型学习方法，通过变分推断解决少样本3D点云语义分割中因监督稀缺导致的原型表示不确定性问题，在ScanNet和S3DIS基准上实现了最先进的性能并提供了可靠的不确定性估计。

摘要翻译

少样本三维语义分割旨在仅用少量标注的支持样本为查询点云生成精确的语义掩码。现有的基于原型的方法通常从支持集中构建紧凑且确定性的原型来指导查询分割。然而，这种刚性表示无法捕捉由稀缺监督引入的内在不确定性，这往往导致鲁棒性下降和泛化能力有限。在本工作中，我们提出了UPL（Uncertainty-aware Prototype Learning，不确定性感知原型学习），这是一种概率方法，旨在将不确定性建模融入少样本三维分割的原型学习中。我们的框架引入了两个关键组件。首先，UPL设计了一个双流原型优化模块，通过联合利用来自支持和查询样本的有限信息来丰富原型表示。其次，我们将原型学习构建为一个变分推断问题，将类别原型视为潜在变量。这种概率化表述实现了显式的不确定性建模，提供了鲁棒且可解释的掩码预测。在广泛使用的ScanNet和S3DIS基准数据集上进行的大量实验表明，我们的UPL在不同设置下均取得了稳定领先的性能，同时提供了可靠的不确定性估计。代码发布于https://fdueblab-upl.github.io/。

摘要 (Abstract)

Few-shot 3D semantic segmentation aims to generate accurate semantic masks for query point clouds with only a few annotated support examples. Existing prototype-based methods typically construct compact and deterministic prototypes from the support set to guide query segmentation. However, such rigid representations are unable to capture the intrinsic uncertainty introduced by scarce supervision, which often results in degraded robustness and limited generalization. In this work, we propose UPL (Uncertainty-aware Prototype Learning), a probabilistic approach designed to incorporate uncertainty modeling into prototype learning for few-shot 3D segmentation. Our framework introduces two key components. First, UPL introduces a dual-stream prototype refinement module that enriches prototype representations by jointly leveraging limited information from both support and query samples. Second, we formulate prototype learning as a variational inference problem, regarding class prototypes as latent variables. This probabilistic formulation enables explicit uncertainty modeling, providing robust and interpretable mask predictions. Extensive experiments on the widely used ScanNet and S3DIS benchmarks show that our UPL achieves consistent state-of-the-art performance under different settings while providing reliable uncertainty estimation. The code is available at https://fdueblab-upl.github.io/.

关键词: few-shot 3D semantic segmentation, point cloud segmentation, uncertainty-aware prototype learning, variational inference, prototype refinement, probabilistic approach, ScanNet benchmark, S3DIS benchmark

68. ❌ Embodied Science: Closing the Discovery Loop with Agentic Embodied AI

作者: Xiang Zhuang, Chenyi Zhou, Kehua Feng, Zhihui Zhu, Yunfan Gao, Yijie Zhong, Yichi Zhang, Junjie Huang, Keyan Ding, Lei Bai, Haofen Wang, Qiang Zhang, Huajun Chen 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19782v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	8.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出’具身科学’范式，通过感知-语言-行动-发现（PLAD）框架将智能体推理与物理执行紧密结合，实现科学发现的闭环。核心相关关键词：1）‘LLM Agents/Autonomous Agents/Agentic Workflow’（10分）- 论文核心研究具身智能体在科学发现中的应用；2）‘AI for Science/Bioinformatics/Cheminformatics’（10分）- 明确应用于生命和化学科学领域；3）‘Chain of Thought/CoT Reasoning/Multi-step Reasoning’（8分）- 涉及智能体对科学知识的推理；4）‘System 2 Thinking/Slow Thinking/In-depth Reasoning’（8分）- 强调深度推理过程；5）‘Self-Correction/Self-Improvement/Self-Reflection’（8分）- 智能体内化结果驱动后续探索；6）‘Tool Use/Function Calling/API Tool Use’（8分）- 涉及物理干预执行；7）‘Large Language Models/LLMs/Foundation Models’（5分）- 可能作为知识推理的基础，但未明确提及。其他关键词与论文的具身科学框架、物理执行闭环或具体技术细节无关。

!!! tip deepseek-chat TL;DR

该论文提出了'具身科学'范式，通过感知-语言-行动-发现（PLAD）框架将智能体推理与物理执行紧密结合，为生命和化学科学中的自主发现系统提供了实现闭环科学发现的路线图。

摘要翻译

人工智能在预测科学属性方面展现出卓越能力，但科学发现本质上仍是由实验周期主导的、涉及物理过程的长期探索。当前大多数计算方法与这一现实存在偏差，它们将科学发现视为孤立、特定任务的预测，而非与物理世界的持续交互。本文提出“具身科学”这一新范式，将科学发现重新定义为智能体推理与物理执行紧密耦合的闭环过程。我们提出统一的感知-语言-行动-发现框架，在该框架中，具身智能体感知实验环境、基于科学知识进行推理、执行物理干预，并通过内化实验结果驱动后续探索。通过将计算推理建立在可靠的物理反馈基础上，该方法弥合了数字预测与实证验证之间的鸿沟，为生命科学与化学领域的自主发现系统提供了发展路径。

摘要 (Abstract)

Artificial intelligence has demonstrated remarkable capability in predicting scientific properties, yet scientific discovery remains an inherently physical, long-horizon pursuit governed by experimental cycles. Most current computational approaches are misaligned with this reality, framing discovery as isolated, task-specific predictions rather than continuous interaction with the physical world. Here, we argue for embodied science, a paradigm that reframes scientific discovery as a closed loop tightly coupling agentic reasoning with physical execution. We propose a unified Perception-Language-Action-Discovery (PLAD) framework, wherein embodied agents perceive experimental environments, reason over scientific knowledge, execute physical interventions, and internalize outcomes to drive subsequent exploration. By grounding computational reasoning in robust physical feedback, this approach bridges the gap between digital prediction and empirical validation, offering a roadmap for autonomous discovery systems in the life and chemical sciences.

关键词: Embodied Science, Agentic Embodied AI, Perception-Language-Action-Discovery (PLAD), Scientific Discovery, Autonomous Discovery Systems, Physical Execution, Agentic Reasoning, Life and Chemical Sciences

69. ❌ FedRG: Unleashing the Representation Geometry for Federated Learning with Noisy Clients

作者: Tian Wen, Zhiqin Yang, Yonggang Zhang, Xuefeng Jiang, Hao Peng, Yuwei Wang, Bo Han 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19722v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于联邦学习（Federated Learning）中噪声标签处理问题，提出了一种基于表示几何的方法（FedRG）。论文的核心技术涉及自监督学习、von Mises-Fisher混合模型、特征空间分析和噪声吸收矩阵，属于分布式机器学习/联邦学习领域。所有评分关键词均围绕大模型（LLMs）及其相关技术（如MoE、Scaling Laws、RLHF、RAG、量化、推理加速等）、大模型应用（如AI for Science）或大模型特定能力（如思维链、工具使用）。论文未涉及任何大模型技术、原理或应用，也未讨论深度学习在科学领域的创新应用，因此与所有关键词完全无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文针对联邦学习中客户端数据存在噪声标签导致性能下降的问题，提出了一种基于表示几何优先原则的方法FedRG，通过自监督创建标签无关的球形表示、拟合vMF混合模型来识别噪声样本，实验表明该方法在异构噪声客户端场景下显著优于现有方法。

摘要翻译

联邦学习（FL）在分布式场景中因不可避免地存在噪声标注而面临性能下降问题。现有方法通过利用损失值来区分数据集中的噪声样本以进行标签校正，已取得一定进展。然而，在异构场景下，依赖标量损失进行噪声样本识别对联邦学习而言缺乏可靠性。本文从表征视角重新审视这一范式，提出基于表征几何的联邦学习方法（\method，即 \textbf{Fed}erated under \textbf{R}epresentation \textbf{G}emometry），该方法遵循“表征几何优先”原则来识别噪声标签。首先，\method通过自监督学习创建与标签无关的球面表征。随后，利用先前识别的干净样本，在该几何结构上迭代拟合球面冯·米塞斯-费舍尔（vMF）混合模型以捕捉语义簇。这一几何证据与语义标签软映射机制相结合，推导出无标签空间与标注标签条件下的特征空间之间的分布差异，从而鲁棒地识别噪声样本，并利用新分离的干净数据集更新vMF混合模型。最后，我们在噪声标签上引入额外的个性化噪声吸收矩阵以实现鲁棒优化。大量实验结果表明，在不同噪声客户端场景下，\method在数据异构的联邦学习中显著优于现有先进方法。

摘要 (Abstract)

Federated learning (FL) suffers from performance degradation due to the inevitable presence of noisy annotations in distributed scenarios. Existing approaches have advanced in distinguishing noisy samples from the dataset for label correction by leveraging loss values. However, noisy samples recognition relying on scalar loss lacks reliability for FL under heterogeneous scenarios. In this paper, we rethink this paradigm from a representation perspective and propose \method~(\textbf{Fed}erated under \textbf{R}epresentation \textbf{G}emometry), which follows \textbf{the principle of ``representation geometry priority’’} to recognize noisy labels. Firstly, \methodcreates label-agnostic spherical representations by using self-supervision. It then iteratively fits a spherical von Mises-Fisher (vMF) mixture model to this geometry using previously identified clean samples to capture semantic clusters. This geometric evidence is integrated with a semantic-label soft mapping mechanism to derive a distribution divergence between the label-free and annotated label-conditioned feature space, which robustly identifies noisy samples and updates the vMF mixture model with the newly separated clean dataset. Lastly, we employ an additional personalized noise absorption matrix on noisy labels to achieve robust optimization. Extensive experimental results demonstrate that \methodsignificantly outperforms state-of-the-art methods for FL with data heterogeneity under diverse noisy clients scenarios.

关键词: Federated Learning, Noisy Labels, Representation Geometry, Self-supervision, von Mises-Fisher Mixture Model, Data Heterogeneity, Label Correction, Robust Optimization

70. ❌ Stepwise: Neuro-Symbolic Proof Search for Automated Systems Verification

作者: Baoding He, Zenan Li, Wei Sun, Yuan Yao, Taolue Chen, Xiaoxing Ma, Zhendong Su 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19715v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是开发一个神经符号证明生成框架，将LLMs应用于系统级验证项目的自动证明搜索。高度相关的关键词包括：1) ‘Large Language Models’ (论文明确使用LLMs进行证明搜索)；2) ‘Post-training/Supervised Fine-tuning’ (论文提到使用证明状态-步骤对数据集微调LLMs)；3) ‘Chain of Thought/Multi-step Reasoning’ (框架执行最佳优先树搜索，涉及多步推理)；4) ‘System 2 Thinking/In-depth Reasoning’ (证明搜索需要深度、系统的推理过程)。其他关键词如MoE、SLMs、Scaling Laws、RLHF等与论文内容无关，论文未涉及这些技术。

!!! tip deepseek-chat TL;DR

该论文提出了一种神经符号证明生成框架，通过结合LLMs和交互式定理证明工具来自动化系统级验证项目的证明搜索，在seL4基准测试中证明了77.6%的定理，显著超越了之前的LLM方法和独立工具。

摘要翻译

通过交互式定理证明的形式化验证正日益用于确保关键系统的正确性，但大型证明脚本的构建仍高度依赖人工，限制了可扩展性。大型语言模型（LLMs）尤其在数学推理方面的进展，使其与软件验证的结合前景日益广阔。本文提出一种神经符号证明生成框架，旨在为系统级验证项目实现证明搜索的自动化。该框架在证明状态上进行最佳优先树搜索，并反复查询LLM以获取下一个候选证明步骤。在神经层面，我们利用证明状态-步骤配对数据集对LLM进行微调；在符号层面，我们整合了一系列交互式定理证明（ITP）工具，用于修复被拒绝的步骤、筛选和排序证明状态，并在搜索进展停滞时自动完成子目标。这种协同作用实现了数据高效的LLM适应和基于语义的搜索空间剪枝。我们在一个新的Isabelle REPL上实现了该框架，该REPL暴露了细粒度的证明状态和自动化工具，并在FVEL seL4基准测试及其他Isabelle开发项目上进行了评估。在seL4上，该系统证明了高达77.6%的定理，显著超越了先前基于LLM的方法以及独立的Sledgehammer工具，同时解决了更多多步骤证明。在进一步基准测试中的结果展示了强大的泛化能力，为可扩展的自动化软件验证指明了一条可行路径。

摘要 (Abstract)

Formal verification via interactive theorem proving is increasingly used to ensure the correctness of critical systems, yet constructing large proof scripts remains highly manual and limits scalability. Advances in large language models (LLMs), especially in mathematical reasoning, make their integration into software verification increasingly promising. This paper introduces a neuro-symbolic proof generation framework designed to automate proof search for systems-level verification projects. The framework performs a best-first tree search over proof states, repeatedly querying an LLM for the next candidate proof step. On the neural side, we fine-tune LLMs using datasets of proof state-step pairs; on the symbolic side, we incorporate a range of ITP tools to repair rejected steps, filter and rank proof states, and automatically discharge subgoals when search progress stalls. This synergy enables data-efficient LLM adaptation and semantics-informed pruning of the search space. We implement the framework on a new Isabelle REPL that exposes fine-grained proof states and automation tools, and evaluate it on the FVEL seL4 benchmark and additional Isabelle developments. On seL4, the system proves up to 77.6% of the theorems, substantially surpassing previous LLM-based approaches and standalone Sledgehammer, while solving significantly more multi-step proofs. Results across further benchmarks demonstrate strong generalization, indicating a viable path toward scalable automated software verification.

关键词: neuro-symbolic proof generation, large language models, automated proof search, systems verification, interactive theorem proving, best-first tree search, fine-tuning, Isabelle REPL

71. ❌ AIGQ: An End-to-End Hybrid Generative Architecture for E-commerce Query Recommendation

作者: Jingcao Xu, Jianyun Zou, Renkai Yang, Zili Geng, Qiang Liu, Haihong Tang 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19710v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出AIGQ架构用于电商查询推荐，核心创新包括IL-SFT（列表级监督微调）和IL-GRPO（策略优化算法），因此与’Post-training OR Supervised Fine-tuning OR SFT’高度相关（10分）；论文基于生成式框架，涉及大模型应用，与’Large Language Models OR LLMs OR Foundation Models’有一定关联（8分）；其他关键词如MoE、SLMs、Scaling Laws、RAG、CoT等未在摘要中提及，故评0分。

!!! tip deepseek-chat TL;DR

该研究针对电商平台查询推荐中传统方法语义浅、冷启动差等问题，提出了首个端到端生成式框架AIGQ，通过IL-SFT和IL-GRPO等创新技术，在淘宝平台上实现了关键业务指标的显著提升。

摘要翻译

预搜索查询推荐（在淘宝首页常被称为HintQ）在意图捕捉和需求发现中起着至关重要的作用，然而传统方法因依赖基于ID的匹配和共点击启发式规则，普遍存在语义理解浅层、冷启动性能差和意外发现率低的问题。为克服这些挑战，我们提出了AIGQ（AI-Generated Query架构），这是首个面向HintQ场景的端到端生成式框架。AIGQ建立在三项核心创新之上，涵盖训练范式、策略优化与部署架构。首先，我们提出兴趣感知列表监督微调（IL-SFT），这是一种列表级监督学习方法，通过会话感知的行为聚合与兴趣引导的重排序策略构建训练样本，以精准建模细粒度的用户意图。相应地，我们设计了兴趣感知列表组相对策略优化（IL-GRPO），这是一种具有双组件奖励机制的新型策略梯度算法，可联合优化单个查询的相关性与全局列表特性，并通过在线点击率（CTR）排序模型的模型奖励进行增强。为满足严格的实时性与低延迟部署要求，我们进一步开发了混合离线-在线架构，包含用于近线个性化用户到查询生成的AIGQ-Direct，以及推理增强变体AIGQ-Think——该组件生成触发词到查询的映射以丰富兴趣多样性。在淘宝平台上进行的广泛离线评估与大规模在线A/B实验表明，AIGQ在平台效能和用户参与度等关键业务指标上均实现了持续显著的提升。

摘要 (Abstract)

Pre-search query recommendation, widely known as HintQ on Taobao’s homepage, plays a vital role in intent capture and demand discovery, yet traditional methods suffer from shallow semantics, poor cold-start performance and low serendipity due to reliance on ID-based matching and co-click heuristics. To overcome these challenges, we propose AIGQ (AI-Generated Query architecture), the first end-to-end generative framework for HintQ scenario. AIGQ is built upon three core innovations spanning training paradigm, policy optimization and deployment architecture. First, we propose Interest-Aware List Supervised Fine-Tuning (IL-SFT), a list-level supervised learning approach that constructs training samples through session-aware behavior aggregation and interest-guided re-ranking strategy to faithfully model nuanced user intent. Accordingly, we design Interest-aware List Group Relative Policy Optimization (IL-GRPO), a novel policy gradient algorithm with a dual-component reward mechanism that jointly optimizes individual query relevance and global list properties, enhanced by a model-based reward from the online click-through rate (CTR) ranking model. To deploy under strict real-time and low-latency requirements, we further develop a hybrid offline-online architecture comprising AIGQ-Direct for nearline personalized user-to-query generation and AIGQ-Think, a reasoning-enhanced variant that produces trigger-to-query mappings to enrich interest diversity. Extensive offline evaluations and large-scale online A/B experiments on Taobao demonstrate that AIGQ consistently delivers substantial improvements in key business metrics across platform effectiveness and user engagement.

关键词: query recommendation, generative framework, supervised fine-tuning, policy optimization, e-commerce, user intent modeling, hybrid architecture, A/B testing

72. ❌ GoAgent: Group-of-Agents Communication Topology Generation for LLM-based Multi-Agent Systems

作者: Hongjiang Chen, Xin Zheng, Yixin Liu, Pengfei Jiao, Shiyuan Li, Huan Liu, Zhidong Zhao, Ziqi Xu, Ibrahim Khalil, Shirui Pan 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19677v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM-based multi-agent systems中的通信拓扑生成问题，与’Large Language Models OR LLMs OR Foundation Models’、‘LLM Agents OR Autonomous Agents OR Agentic Workflow’、‘Multi-agent Systems OR Agent Coordination’高度相关（10分），因为这些是论文的基础技术和核心研究对象。其他关键词如MoE、SLMs、训练方法、推理加速、科学AI应用等均未在论文中涉及，故评0分。

!!! tip deepseek-chat TL;DR

该论文针对LLM-based多智能体系统中通信拓扑生成问题，提出了GoAgent方法，通过将协作组作为原子单元构建通信图并引入条件信息瓶颈压缩通信，在六个基准测试中实现了93.84%的平均准确率并减少约17%的token消耗。

摘要翻译

基于大语言模型（LLM）的多智能体系统（MAS）在解决复杂任务方面展现出卓越能力，但其效能高度依赖于协调智能体交互的底层通信拓扑结构。在这些系统中，成功解决问题通常需要针对特定任务构建群组结构，以分解并攻克子任务。然而，现有方法大多以节点为中心生成通信拓扑，使得群组结构仅从局部连接决策中隐式浮现，而非显式建模，这往往导致协调效率低下并产生不必要的通信开销。为应对这一局限，我们提出GoAgent（Group-of-Agents），一种将协作群组视为MAS构建原子单元的显式通信拓扑生成方法。具体而言，GoAgent首先通过大语言模型枚举与任务相关的候选群组，随后以自回归方式将这些群组作为原子单元进行选择与连接，以构建最终通信图，从而同时捕捉群组内聚性与群组间协调性。为缓解拓扑扩展过程中固有的通信冗余与噪声传播问题，我们进一步引入条件信息瓶颈（CIB）目标，以压缩群组间通信，在保留任务相关信号的同时过滤冗余的历史噪声。在六个基准测试上的大量实验表明，GoAgent以93.84%的平均准确率实现了最先进的性能，同时降低了约17%的令牌消耗。

摘要 (Abstract)

Large language model (LLM)-based multi-agent systems (MAS) have demonstrated exceptional capabilities in solving complex tasks, yet their effectiveness depends heavily on the underlying communication topology that coordinates agent interactions. Within these systems, successful problem-solving often necessitates task-specific group structures to divide and conquer subtasks. However, most existing approaches generate communication topologies in a node-centric manner, leaving group structures to emerge implicitly from local connectivity decisions rather than modeling them explicitly, often leading to suboptimal coordination and unnecessary communication overhead. To address this limitation, we propose GoAgent (Group-of-Agents), a communication topology generation method that explicitly treats collaborative groups as the atomic units of MAS construction. Specifically, GoAgent first enumerates task-relevant candidate groups through an LLM and then autoregressively selects and connects these groups as atomic units to construct the final communication graph, jointly capturing intra-group cohesion and inter-group coordination. To mitigate communication redundancy and noise propagation inherent in expanding topologies, we further introduce a conditional information bottleneck (CIB) objective that compresses inter-group communication, preserving task-relevant signals while filtering out redundant historical noise. Extensive experiments on six benchmarks demonstrate the state-of-the-art performance of GoAgent with 93.84% average accuracy while reducing token consumption by about 17%.

关键词: Large Language Models, Multi-agent Systems, Communication Topology, Group-of-Agents, Conditional Information Bottleneck, Agent Coordination, LLM-based MAS, Token Efficiency

73. ❌ A Subgoal-driven Framework for Improving Long-Horizon LLM Agents

作者: Taiyi Wang, Sian Gooding, Florian Hartmann, Oriana Riva, Edward Grefenstette 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19685v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM智能体在长视野任务中的规划问题，提出基于子目标分解的在线规划框架和基于里程碑奖励的RL训练框架。与"LLM Agents"高度相关（10分），因为论文聚焦LLM-based agents的改进；与"Large Language Models"高度相关（10分），因为研究基于Gemini、Gemma3-12B等LLM；与"Chain of Thought"和"System 2 Thinking"有一定关联（5分），因为涉及多步推理和深度规划；与"Tool Use"有一定关联（5分），因为智能体在数字环境中执行动作；其他关键词如MoE、量化、RAG等未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文针对LLM智能体在长视野任务中容易迷失方向的问题，提出了基于子目标分解的在线规划框架和基于里程碑奖励的强化学习训练框架MiRA，显著提升了智能体在Web导航等任务中的成功率。

摘要翻译

基于大语言模型（LLM）的智能体已成为数字环境（包括移动界面、操作系统和网络浏览器）中强大的自主控制器。以网络导航为例，其需要处理动态内容和长序列操作，因此尤其具有挑战性。现有基于LLM的智能体在长程规划方面主要面临两大困难。在在线执行过程中，随着新信息的不断涌入，智能体常常迷失方向，缺乏一条清晰且适应性的通往最终目标的路径。这一问题在强化学习（RL）微调阶段进一步加剧：稀疏且延迟的奖励信号使得智能体难以识别哪些动作促成了成功，阻碍了其在长周期任务中保持连贯推理。为应对这些挑战，我们提出两项贡献。首先，我们引入一种智能体框架，该框架利用专有模型通过子目标分解进行在线规划。其次，我们提出了MiRA（基于里程碑的强化学习增强智能体），这是一个使用密集、基于里程碑的奖励信号的强化学习训练框架。实时规划机制将Gemini等专有模型在WebArena-Lite基准测试上的成功率（SR）绝对提升了约10%。同时，将MiRA应用于开源的Gemma3-12B模型，使其成功率从6.4%提升至43.0%。这一表现超越了GPT-4-Turbo（17.6%）和GPT-4o（13.9%）等专有系统，也超越了此前开源的先进模型WebRL（38.4%）。总体而言，我们的研究结果表明，将显式的推理时规划与基于里程碑的奖励相结合，能显著提升智能体的长程任务能力，为构建更鲁棒、更通用的自主系统铺平了道路。

摘要 (Abstract)

Large language model (LLM)-based agents have emerged as powerful autonomous controllers for digital environments, including mobile interfaces, operating systems, and web browsers. Web navigation, for example, requires handling dynamic content and long sequences of actions, making it particularly challenging. Existing LLM-based agents struggle with long-horizon planning in two main ways. During online execution, they often lose track as new information arrives, lacking a clear and adaptive path toward the final goal. This issue is further exacerbated during reinforcement learning (RL) fine-tuning, where sparse and delayed rewards make it difficult for agents to identify which actions lead to success, preventing them from maintaining coherent reasoning over extended tasks. To address these challenges, we propose two contributions. First, we introduce an agent framework that leverages proprietary models for online planning through subgoal decomposition. Second, we present MiRA (Milestoning your Reinforcement Learning Enhanced Agent), an RL training framework that uses dense, milestone-based reward signals. The real-time planning mechanism improves proprietary models such as Gemini by approximately a 10% absolute increase in success rate (SR) on the WebArena-Lite benchmark. Meanwhile, applying MiRA to the open Gemma3-12B model increases its success rate from 6.4% to 43.0%. This performance surpasses proprietary systems such as GPT-4-Turbo (17.6%) and GPT-4o (13.9%), as well as the previous open-model state of the art, WebRL (38.4%). Overall, our findings demonstrate that combining explicit inference-time planning with milestone-based rewards significantly improves an agent’s long-horizon capabilities, paving the way for more robust and general-purpose autonomous systems.

关键词: LLM agents, long-horizon planning, subgoal decomposition, reinforcement learning, milestone-based rewards, Web navigation, autonomous systems, real-time planning

74. ❌ ATHENA: Adaptive Test-Time Steering for Improving Count Fidelity in Diffusion Models

作者: Mohammad Shahab Sepehri, Asal Mehradfar, Berk Tinaz, Salman Avestimehr, Mahdi Soltanolkotabi 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19676v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度
“Large Language Models” OR “LLMs” OR “Foundation Models”	1.0	0.0/10
“Mixture of Experts” OR “MoE” OR “Sparse Models”	1.0	0.0/10
“Small Language Models” OR “SLMs” OR “On-device AI”	1.0	0.0/10
“Scaling Laws” AND “Data Quality”	1.0	0.0/10
“Pre-training” OR “Continual Pre-training” OR “Domain Adaptation”	1.0	0.0/10
“Post-training” OR “Supervised Fine-tuning” OR “SFT”	1.0	0.0/10
“Instruction Tuning” OR “Alignment” OR “Value Alignment”	1.0	0.0/10
“RLHF” OR “RLAIF” OR “Direct Preference Optimization” OR “DPO”	1.0	0.0/10
“PEFT” OR “LoRA” OR “Parameter-efficient Fine-tuning”	1.0	0.0/10
“Retrieval-Augmented Generation” OR “RAG” OR “Retrieval-Generation”	1.0	0.0/10
“Context Window Extension” OR “Long Context LLMs”	1.0	0.0/10
“KV Cache Compression” OR “Linear Attention” OR “FlashAttention”	1.0	0.0/10
“Chain of Thought” OR “CoT Reasoning” OR “Multi-step Reasoning”	1.0	0.0/10
“System 2 Thinking” OR “Slow Thinking” OR “In-depth Reasoning”	1.0	0.0/10
“Monte Carlo Tree Search” OR “MCTS” AND “LLM”	1.0	0.0/10
“Self-Correction” OR “Self-Improvement” OR “Self-Reflection”	1.0	0.0/10
“LLM Agents” OR “Autonomous Agents” OR “Agentic Workflow”	1.0	0.0/10
“Tool Use” OR “Function Calling” OR “API Tool Use”	1.0	0.0/10
“Multi-agent Systems” OR “Agent Coordination”	1.0	0.0/10
“Quantization” OR “Model Compression” OR “Low-bit Weights”	1.0	0.0/10
“Speculative Decoding” OR “Inference Acceleration”	1.0	0.0/10
“Hallucination Mitigation” OR “Factuality” OR “Truthfulness”	1.0	0.0/10
“Mechanistic Interpretability” OR “Explainable AI”	1.0	0.0/10
“World Models” AND “General World Models”	1.0	0.0/10
“Model Merging” OR “Model Soups” OR “Weight Averaging”	1.0	0.0/10
“In-context Learning” OR “Many-shot Learning”	1.0	0.0/10
“AI for Science” OR “Bioinformatics” OR “Cheminformatics”	1.0	0.0/10

评分理由: 论文专注于扩散模型（Diffusion Models）在文本到图像生成中的对象计数保真度问题，提出了一个名为ATHENA的测试时自适应引导框架。所有给定的评分关键词均明确针对大语言模型（LLMs）及其相关技术（如MoE、SFT、RLHF、RAG、CoT、Agents等），而本文研究的是扩散模型，属于生成式AI的不同分支。扩散模型与LLMs在架构、训练方法和应用上虽有重叠（如同属生成模型），但本文未涉及任何LLM-specific的技术、原理或应用。因此，所有关键词与论文内容完全无关，得分为0。

!!! tip deepseek-chat TL;DR

该论文针对文本到图像扩散模型在生成指定数量对象时存在的系统性问题，提出了一种无需修改模型架构或重新训练的测试时自适应引导框架ATHENA，通过早期噪声校正有效提高了对象计数的保真度。

摘要翻译

文本到图像扩散模型在视觉保真度方面表现出色，但在提示指定明确物体数量时却意外地存在系统性数值控制失效问题。为应对这一局限，我们提出了ATHENA——一种与模型无关、可在测试时自适应调整的引导框架，它无需修改模型架构或重新训练即可提升物体数量准确性。ATHENA利用采样过程中的中间表征来估计物体数量，并在去噪过程早期施加基于数量感知的噪声校正，从而在结构错误难以修正前引导生成轨迹。我们提出了ATHENA的三种渐进式进阶变体，它们以额外计算量为代价提升数值精度，其范围从基于静态提示的引导到动态调整的数量感知控制。在现有基准测试集及我们新构建的视觉与语义复杂度数据集上的实验表明，ATHENA能持续提升数量准确性，尤其在目标数量较大时效果显著，同时在多种扩散模型骨干网络上保持了良好的精度-运行时权衡关系。

摘要 (Abstract)

Text-to-image diffusion models achieve high visual fidelity but surprisingly exhibit systematic failures in numerical control when prompts specify explicit object counts. To address this limitation, we introduce ATHENA, a model-agnostic, test-time adaptive steering framework that improves object count fidelity without modifying model architectures or requiring retraining. ATHENA leverages intermediate representations during sampling to estimate object counts and applies count-aware noise corrections early in the denoising process, steering the generation trajectory before structural errors become difficult to revise. We present three progressively more advanced variants of ATHENA that trade additional computation for improved numerical accuracy, ranging from static prompt-based steering to dynamically adjusted count-aware control. Experiments on established benchmarks and a new visually and semantically complex dataset show that ATHENA consistently improves count fidelity, particularly at higher target counts, while maintaining favorable accuracy-runtime trade-offs across multiple diffusion backbones.

关键词: Diffusion Models, Text-to-Image Generation, Object Count Fidelity, Test-Time Adaptation, Noise Correction, Denoising Process, Model-Agnostic Framework, Adaptive Steering

作者: Zhijian Gong, Tianren Yao, Wenjia Dong, Xueyuan Xu 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19667v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究EEG信号与文本联合引导的视觉重建，属于AI在神经科学/生物医学成像领域的应用。所有关键词均与大模型技术原理、训练方法、推理优化、代理系统等直接相关，而本文未涉及任何大模型或深度学习技术原理的创新，仅使用深度学习进行特定模态（EEG）的特征提取与图像生成。唯一相关的是“AI for Science OR Bioinformatics OR Cheminformatics”，因其属于AI在科学（神经科学）领域的应用，但非核心创新点，故给5分（有一定关联）。其余关键词完全无关，均为0分。

!!! tip deepseek-chat TL;DR

该论文针对现有EEG视觉重建方法过度依赖文本/图像对齐导致空间和色彩细节丢失的问题，提出了一个联合模态视觉重建（JMVR）框架，通过独立处理EEG和文本模态并采用多尺度EEG编码，在THINGS-EEG数据集上实现了最先进的性能，显著提升了空间结构和色彩保真度。

摘要翻译

人类视觉重建旨在基于受试者提供的描述及对应神经信号，重构细粒度视觉刺激。作为广泛采用的模态，脑电图（Electroencephalography, EEG）能够捕捉丰富的视觉认知信息，涵盖场景中复杂的空间关系与色彩细节。然而，现有方法深度依赖于对齐框架，强制将EEG特征与文本或图像语义表征进行匹配。这种依赖性可能压缩EEG中蕴含的丰富空间与色彩细节，导致仅能实现条件式图像生成，而非高保真视觉重建。为突破此局限，本文提出一种新颖的联合模态视觉重建（Joint-Modal Visual Reconstruction, JMVR）框架。该框架将EEG与文本视为独立模态进行联合学习，以保留EEG特有的信息用于重建。进一步采用多尺度EEG编码策略，同步捕获细粒度和粗粒度特征，并结合图像增强技术以提升感知细节的恢复能力。在THINGS-EEG数据集上的大量实验表明，JMVR相较于六种基线方法实现了最先进的性能，尤其在空间结构建模与色彩保真度方面展现出卓越能力。

摘要 (Abstract)

Human visual reconstruction aims to reconstruct fine-grained visual stimuli based on subject-provided descriptions and corresponding neural signals. As a widely adopted modality, Electroencephalography (EEG) captures rich visual cognition information, encompassing complex spatial relationships and chromatic details within scenes. However, current approaches are deeply coupled with an alignment framework that forces EEG features to align with text or image semantic representation. The dependency may condense the rich spatial and chromatic details in EEG that achieved mere conditioned image generation rather than high-fidelity visual reconstruction. To address this limitation, we propose a novel Joint-Modal Visual Reconstruction (JMVR) framework. It treats EEG and text as independent modalities for joint learning to preserve EEG-specific information for reconstruction. It further employs a multi-scale EEG encoding strategy to capture both fine- and coarse-grained features, alongside image augmentation to enhance the recovery of perceptual details. Extensive experiments on the THINGS-EEG dataset demonstrate that JMVR achieves SOTA performance against six baseline methods, specifically exhibiting superior capabilities in modeling spatial structure and chromatic fidelity.

关键词: EEG-based visual reconstruction, Joint-modal learning, Multi-scale EEG encoding, High-fidelity visual reconstruction, THINGS-EEG dataset, Spatial structure modeling, Chromatic fidelity, Image augmentation

作者: Renhong Huang, Ning Tang, Jiarong Xu, Yuxuan Cao, Qingqian Tu, Sheng Guo, Bo Zheng, Huiyuan Liu, Yang Yang 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19649v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是LLM-based agent social simulation sandbox，直接使用LLM作为基础技术（10分），通过SFT和DPO方法优化用户代理（各10分），构建LLM agents进行社会模拟（10分），涉及多智能体协调（5分）。其他关键词如MoE、SLMs、RAG等未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

该论文提出了PolicySim，一个基于LLM的社交模拟沙箱，通过SFT和DPO优化的用户代理和自适应干预模块，实现了对平台干预政策的主动评估和优化。

摘要翻译

社交平台作为信息交换的核心枢纽，用户行为与平台干预共同塑造着舆论走向。然而，推荐算法和内容过滤等干预策略可能无意中加剧信息茧房和观点极化，带来显著的社会风险。因此，主动评估此类策略的影响至关重要。现有方法主要依赖被动的在线A/B测试，风险往往在部署后才被察觉，导致风险识别滞后且成本高昂。基于大语言模型（LLM）的社会模拟为部署前评估提供了有前景的替代方案，但现有方法在真实模拟平台干预机制以及纳入平台反馈方面仍存在不足。构建可操作的框架以评估和优化平台策略，必须弥补这些差距。为此，我们提出PolicySim，一个基于LLM的社会模拟沙箱，用于主动评估和优化干预策略。PolicySim通过两个核心组件模拟用户行为与平台干预之间的双向动态：(1) 一个通过监督微调（SFT）和直接偏好优化（DPO）精炼的用户智能体模块，以实现贴合特定平台的行为真实性；(2) 一个自适应干预模块，采用结合消息传递的情境赌博机来捕捉动态网络结构。实验表明，PolicySim能够在微观和宏观层面准确模拟平台生态系统，并支持有效的干预策略优化。

摘要 (Abstract)

Social platforms serve as central hubs for information exchange, where user behaviors and platform interventions jointly shape opinions. However, intervention policies like recommendation and content filtering, can unintentionally amplify echo chambers and polarization, posing significant societal risks. Proactively evaluating the impact of such policies is therefore crucial. Existing approaches primarily rely on reactive online A/B testing, where risks are identified only after deployment, making risk identification delayed and costly. LLM-based social simulations offer a promising pre-deployment alternative, but current methods fall short in realistically modeling platform interventions and incorporating feedback from the platform. Bridging these gaps is essential for building actionable frameworks to assess and optimize platform policies. To this end, we propose PolicySim, an LLM-based social simulation sandbox for the proactive assessment and optimization of intervention policies. PolicySim models the bidirectional dynamics between user behavior and platform interventions through two key components: (1) a user agent module refined via supervised fine-tuning (SFT) and direct preference optimization (DPO) to achieve platform-specific behavioral realism; and (2) an adaptive intervention module that employs a contextual bandit with message passing to capture dynamic network structures. Experiments show that PolicySim can accurately simulate platform ecosystems at both micro and macro levels and support effective intervention policy.

关键词: LLM-based agent, social simulation, policy optimization, supervised fine-tuning, direct preference optimization, adaptive intervention, platform ecosystems, proactive assessment

77. ❌ The Residual Stream Is All You Need: On the Redundancy of the KV Cache in Transformer Inference

作者: Kaleem Ullah Qasim, Jiashu Zhang, Muhammad Kafeel Shaheen, Razan Alharith, Heying Zhang 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19664v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	10.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	8.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文的核心贡献是证明了Transformer推理中KV缓存的冗余性，并提出了一种名为KV-Direct的推理方案，通过从残差流重新计算KV对来减少内存占用。因此，它与’KV Cache Compression OR Linear Attention OR FlashAttention’高度相关（10分），因为这是其核心技术创新。与’Large Language Models OR LLMs OR Foundation Models’相关（8分），因为论文测试了多个大模型（135M到4B参数）。与’Speculative Decoding OR Inference Acceleration’相关（8分），因为KV-Direct方案旨在通过重新计算加速推理并减少内存，从而提升推理效率。其他关键词如MoE、SFT、RAG、CoT等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文证明了Transformer推理中的KV缓存是完全冗余的，并提出了一种名为KV-Direct的推理方案，通过从残差流重新计算KV对来显著减少内存占用，同时保持输出完全一致。

摘要翻译

键值（KV）缓存被广泛视为Transformer推理中的核心状态，大量研究工作致力于设计策略以压缩、淘汰或近似其条目。我们证明该状态完全冗余：每一层的键和值都是残差流的确定性投影，从每个词元的单一残差向量重新计算它们不会产生任何重构误差——不是近似，而是比特级完全一致。我们在来自四种架构家族的六个模型（参数量1.35亿至40亿）中验证了这一点。通过逐层跨任务残差修补，修补后与原始输出分布之间的D_KL = 0，证实残差流满足马尔可夫性质，且是唯一的信息承载状态。完全移除缓存并从头重新计算，在所有测试模型上采用贪心解码时均产生词元完全一致的输出。基于此结果，我们提出了KV-Direct——一种有界内存推理方案，它检查点残差向量（在Gemma 3-4B上每词元5 KB）而非完整KV对（136 KB），并按需重新计算键和值。在超过20轮对话中，KV-Direct将峰值内存控制在42 MB，而标准缓存增长至超过103 MB。与五种淘汰基线方法（H2O、StreamingLLM、SnapKV、TOVA、仅窗口缓存）相比，KV-Direct在所有缓存预算下均保持100%词元匹配；所有基线方法则下降至5-28%。逐操作延迟分析表明，在中等批次大小下，重新计算比读取缓存张量快达5倍。代码发布于https://github.com/Kaleemullahqasim/KV-Direct。

摘要 (Abstract)

The key-value (KV) cache is widely treated as essential state in transformer inference, and a large body of work engineers policies to compress, evict, or approximate its entries. We prove that this state is entirely redundant: keys and values at every layer are deterministic projections of the residual stream, and recomputing them from a single residual vector per token incurs exactly zero reconstruction error, not approximately, but bit-identically. We verify this across six models from four architecture families (135M to 4B parameters). Cross-task residual patching at every layer produces D_KL = 0 between patched and original output distributions, confirming that the residual stream satisfies a Markov property and is the sole information-carrying state. Removing the cache entirely and recomputing from scratch yields token-identical output under greedy decoding on all models tested. We build on this result with KV-Direct, a bounded-memory inference scheme that checkpoints residual vectors (5 KB per token on Gemma 3-4B) instead of full KV pairs (136 KB), recomputing keys and values on demand. Over 20 conversation turns, KV-Direct holds peak memory at 42 MB while the standard cache grows past 103 MB. Against five eviction baselines (H2O, StreamingLLM, SnapKV, TOVA, window-only), KV-Direct maintains 100% token match at every cache budget; all baselines degrade to 5-28%. A per-operation latency analysis shows recomputation runs up to 5x faster than reading cached tensors at moderate batch sizes. Code is available at https://github.com/Kaleemullahqasim/KV-Direct.

关键词: KV cache, transformer inference, residual stream, memory efficiency, KV-Direct, recomputation, bounded-memory inference, token-identical output

78. ❌ OmniDiT: Extending Diffusion Transformer to Omni-VTON Framework

作者: Weixuan Zeng, Pengcheng Wei, Huaiqing Wang, Boheng Zhang, Jia Sun, Dewen Fan, Lin HE, Long Chen, Qianqian Gan, Fan Yang, Tingting Gao 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19643v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉领域的虚拟试穿技术，使用扩散变换器（Diffusion Transformer）解决VTON/VTOFF任务，与所有评分关键词（均针对大语言模型技术原理、训练方法、推理优化、应用等）完全无关。论文未涉及任何语言模型、MoE、缩放定律、训练方法、对齐、推理技术、代理系统、模型压缩、幻觉缓解、可解释性、科学AI等主题。

!!! tip deepseek-chat TL;DR

该论文提出OmniDiT框架，基于扩散变换器统一虚拟试穿和试脱任务，通过自演化数据管道构建大规模数据集，并引入Shifted Window Attention降低计算复杂度，在复杂场景下实现了优异的性能。

摘要翻译

尽管虚拟试穿（VTON）与虚拟脱衣（VTOFF）技术发展迅速，现有VTON方法仍面临细粒度细节保留、复杂场景泛化能力、流程复杂及高效推理等方面的挑战。为解决这些问题，我们提出了OmniDiT——一个基于扩散Transformer的全能虚拟试穿框架，它将试穿与脱衣任务统一整合至单一模型中。具体而言，我们首先构建了一个自演进的数据处理流程以持续生成数据，并创建了大规模VTON数据集Omni-TryOn，其中包含超过38万组多样化、高质量的服装-模特-试穿图像对及精细文本描述。随后，我们采用令牌拼接策略并设计了自适应位置编码，以有效融合多重参考条件。为缓解长序列计算瓶颈，我们首次将移位窗口注意力机制引入扩散模型，从而实现线性计算复杂度。为改善局部窗口注意力导致的性能下降，我们采用多时间步预测与对齐损失函数以提升生成保真度。实验表明，在各种复杂场景下，我们的方法在无模型VTON与VTOFF任务中均取得最优性能，并在基于模型的VTON任务中达到与当前SOTA方法相当的效果。

摘要 (Abstract)

Despite the rapid advancement of Virtual Try-On (VTON) and Try-Off (VTOFF) technologies, existing VTON methods face challenges with fine-grained detail preservation, generalization to complex scenes, complicated pipeline, and efficient inference. To tackle these problems, we propose OmniDiT, an omni Virtual Try-On framework based on the Diffusion Transformer, which combines try-on and try-off tasks into one unified model. Specifically, we first establish a self-evolving data curation pipeline to continuously produce data, and construct a large VTON dataset Omni-TryOn, which contains over 380k diverse and high-quality garment-model-tryon image pairs and detailed text prompts. Then, we employ the token concatenation and design an adaptive position encoding to effectively incorporate multiple reference conditions. To relieve the bottleneck of long sequence computation, we are the first to introduce Shifted Window Attention into the diffusion model, thus achieving a linear complexity. To remedy the performance degradation caused by local window attention, we utilize multiple timestep prediction and an alignment loss to improve generation fidelity. Experiments reveal that, under various complex scenes, our method achieves the best performance in both the model-free VTON and VTOFF tasks and a performance comparable to current SOTA methods in the model-based VTON task.

关键词: Virtual Try-On, Diffusion Transformer, OmniDiT, Shifted Window Attention, VTON, VTOFF, Omni-TryOn dataset, linear complexity

79. ❌ MetaCues: Enabling Critical Engagement with Generative AI for Information Seeking and Sensemaking

作者: Anjali Singh, Karan Taneja, Zhitong Guan, Soo Young Rieh 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19634v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究Generative AI搜索工具的用户交互设计，特别是通过MetaCues工具提供元认知提示来促进批判性参与和信息验证。这与大模型应用相关（关键词1得5分），涉及批判性思维（关键词14得5分）、自我反思（关键词16得5分）、事实性验证（关键词22得5分）和可解释AI（关键词23得5分）。但论文不涉及具体的大模型技术原理创新（如MoE、量化、推理加速等），也不涉及科学领域的特定应用，因此大部分技术关键词得0分。

!!! tip deepseek-chat TL;DR

该论文研究了如何通过MetaCues工具提供元认知提示来改善用户对生成式AI搜索工具的批判性参与，实验表明该工具能提升用户对搜索主题的态度判断信心并促进更广泛的探究。

摘要翻译

生成式人工智能（GenAI）搜索工具在信息检索中的应用日益广泛，但其设计往往倾向于鼓励认知卸载，这可能导致用户被动参与、选择性注意及信息同质化。有效使用此类工具需要元认知参与，包括构建优质提示词、验证AI输出结果以及对信息进行批判性介入。我们开发了MetaCues——一种基于生成式人工智能的新型交互式信息检索工具，该工具在提供AI生成回答的同时嵌入元认知提示，并配备笔记界面以引导用户的检索过程及相关学习。通过一项在线研究（N = 146），我们在两个需要参与者探索多元视角以形成知情判断的广泛搜索主题中，将MetaCues与无提示的基础工具进行了对比。关于参与者检索行为的初步结果表明，MetaCues能提升用户对搜索主题态度判断的自信心，并促进更广泛的探究行为；其中后一种效应主要出现在争议性较低且参与者相对陌生的主题中。基于此，我们提出了未来对搜索交互行为与探究模式进行定性研究的方向。

摘要 (Abstract)

Generative AI (GenAI) search tools are increasingly used for information seeking, yet their design tends to encourage cognitive offloading, which may lead to passive engagement, selective attention, and informational homogenization. Effective use requires metacognitive engagement to craft good prompts, verify AI outputs, and critically engage with information. We developed MetaCues, a novel GenAI-based interactive tool for information seeking that delivers metacognitive cues alongside AI responses and a note-taking interface to guide users’ search and associated learning. Through an online study (N = 146), we compared MetaCues to a baseline tool without cues, across two broad search topics that required participants to explore diverse perspectives in order to make informed judgments. Preliminary findings regarding participants’ search behavior show that MetaCues leads to increased confidence in attitudinal judgments about the search topic as well as broader inquiry, with the latter effect emerging primarily for the topic that was less controversial and with which participants had relatively less familiarity. Accordingly, we outline directions for future qualitative exploration of search interactions and inquiry patterns.

关键词: Generative AI, information seeking, metacognitive engagement, critical engagement, AI search tools, user interaction design, cognitive offloading, note-taking interface

80. ❌ HyEvo: Self-Evolving Hybrid Agentic Workflows for Efficient Reasoning

作者: Beibei Xu, Yutong Ye, Chuyun Shen, Yingbo Zhou, Cheng Chen, Mingsong Chen 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19639v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	15.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	10.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	8.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM驱动的智能体工作流（Agentic Workflow）自动化生成框架HyEvo，通过混合概率LLM节点和确定性代码节点来提升推理效率和性能。高度相关的关键词包括：LLM Agents/Autonomous Agents/Agentic Workflow（核心主题，15分）、Large Language Models/LLMs/Foundation Models（基础技术，10分）、Tool Use/Function Calling/API Tool Use（涉及代码节点执行，10分）。中等相关的关键词包括：Chain of Thought/CoT Reasoning/Multi-step Reasoning（涉及复杂任务推理，8分）、System 2 Thinking/Slow Thinking/In-depth Reasoning（涉及深度推理，8分）、Self-Correction/Self-Improvement/Self-Reflection（涉及迭代优化机制，8分）、Speculative Decoding/Inference Acceleration（涉及降低推理延迟，8分）。其余关键词与论文内容无关或未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对现有LLM-only智能体工作流效率低下、性能不足的问题，提出了HyEvo框架，通过混合LLM节点和代码节点的异构原子合成与进化策略，在多个推理和编码基准上显著超越现有方法，同时大幅降低了推理成本和执行延迟。

摘要翻译

尽管智能体工作流在解决复杂任务方面展现出巨大潜力，但现有的自动化生成方法仍效率低下且性能不足，因为它们依赖于预定义的操作符库和同质的纯大语言模型工作流——其中所有任务级计算均通过概率推理完成。为应对这些局限，我们提出HyEvo框架，一种基于异构原子合成的自动化工作流生成方法。HyEvo将用于语义推理的概率型大语言模型节点与基于规则执行的确定性代码节点相结合，将可预测操作从大语言模型推理中卸载，从而降低推理成本与执行延迟。为高效探索混合搜索空间，HyEvo采用大语言模型驱动的多岛进化策略，配合“先反思后生成”机制，通过执行反馈迭代优化工作流拓扑结构与节点逻辑。综合实验表明，HyEvo在多种推理与代码生成基准测试中均持续优于现有方法，相比当前最先进的开源基线，其推理成本与执行延迟最高可分别降低19倍与16倍。

摘要 (Abstract)

Although agentic workflows have demonstrated strong potential for solving complex tasks, existing automated generation methods remain inefficient and underperform, as they rely on predefined operator libraries and homogeneous LLM-only workflows in which all task-level computation is performed through probabilistic inference. To address these limitations, we propose HyEvo, an automated workflow-generation framework that leverages heterogeneous atomic synthesis. HyEvo integrates probabilistic LLM nodes for semantic reasoning with deterministic code nodes for rule-based execution, offloading predictable operations from LLM inference and reducing inference cost and execution latency. To efficiently navigate the hybrid search space, HyEvo employs an LLM-driven multi-island evolutionary strategy with a reflect-then-generate mechanism, iteratively refining both workflow topology and node logic via execution feedback. Comprehensive experiments show that HyEvo consistently outperforms existing methods across diverse reasoning and coding benchmarks, while reducing inference cost and execution latency by up to 19$\times$ and 16$\times$, respectively, compared to the state-of-the-art open-source baseline.

关键词: Agentic Workflow, LLM Agents, Hybrid Workflow, Automated Workflow Generation, Reasoning Efficiency, Inference Cost Reduction, Evolutionary Strategy, Heterogeneous Atomic Synthesis

81. ❌ Dual Prompt-Driven Feature Encoding for Nighttime UAV Tracking

作者: Yiheng Wang, Changhong Fu, Liangliang Yao, Haobo Zuo, Zijie Zhang 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19628v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于无人机夜间跟踪的计算机视觉任务，提出了一种双提示驱动的特征编码方法（DPTracker），涉及金字塔照明提示器和动态视点提示器等技术。所有评分关键词均与大语言模型、深度学习技术原理或AI科学应用直接相关，而本文研究的是传统计算机视觉中的目标跟踪问题，未涉及大模型、深度学习创新技术或AI科学应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

本文提出了一种双提示驱动的特征编码方法（DPTracker），通过集成提示条件特征适应和上下文感知提示演化来解决夜间无人机跟踪中因忽略关键照明和视点线索而导致的性能下降问题，实验验证了该方法在夜间无人机跟踪中的有效性和鲁棒性。

摘要翻译

鲁棒特征编码通过实现对目标外观与运动的精细感知，构成了无人机跟踪的基础，对确保可靠跟踪起着关键作用。然而，现有特征编码方法往往忽视关键的照明与视角线索，而这些线索对于在具有挑战性的夜间条件下实现鲁棒感知至关重要，其缺失会导致跟踪性能下降。为克服上述局限，本研究提出了一种双提示驱动的特征编码方法，该方法整合了提示条件特征自适应与上下文感知提示演化，以促进域不变特征编码。具体而言，研究提出了金字塔照明提示器，用于提取多尺度频率感知的照明提示。动态视角提示器通过调制可变形卷积偏移以适应视角变化，使跟踪器能够学习视角不变特征。大量实验验证了所提出的双提示驱动跟踪器（DPTracker）在应对夜间无人机跟踪任务中的有效性。消融研究凸显了DPTracker中各组件的贡献。在多样化夜间无人机跟踪场景下的实际测试进一步证明了该方法的鲁棒性与实用价值。代码与演示视频可在 https://github.com/yiheng-wang-duke/DPTracker 获取。

摘要 (Abstract)

Robust feature encoding constitutes the foundation of UAV tracking by enabling the nuanced perception of target appearance and motion, thereby playing a pivotal role in ensuring reliable tracking. However, existing feature encoding methods often overlook critical illumination and viewpoint cues, which are essential for robust perception under challenging nighttime conditions, leading to degraded tracking performance. To overcome the above limitation, this work proposes a dual prompt-driven feature encoding method that integrates prompt-conditioned feature adaptation and context-aware prompt evolution to promote domain-invariant feature encoding. Specifically, the pyramid illumination prompter is proposed to extract multi-scale frequency-aware illumination prompts. %The dynamic viewpoint prompter adapts the sampling to different viewpoints, enabling the tracker to learn view-invariant features. The dynamic viewpoint prompter modulates deformable convolution offsets to accommodate viewpoint variations, enabling the tracker to learn view-invariant features. Extensive experiments validate the effectiveness of the proposed dual prompt-driven tracker (DPTracker) in tackling nighttime UAV tracking. Ablation studies highlight the contribution of each component in DPTracker. Real-world tests under diverse nighttime UAV tracking scenarios further demonstrate the robustness and practical utility. The code and demo videos are available at https://github.com/yiheng-wang-duke/DPTracker.

关键词: UAV tracking, nighttime tracking, feature encoding, prompt-driven, illumination prompter, viewpoint prompter, domain-invariant features, DPTracker

82. ❌ DeepStock: Reinforcement Learning with Policy Regularizations for Inventory Management

作者: Yaqi Xie, Xinru Hao, Jiaxi Liu, Will Ma, Linwei Xin, Lei Cao, Yidong Zhang 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19621v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于深度强化学习（DRL）在库存管理中的应用，通过引入基于经典库存概念（如"Base Stock”）的策略正则化来改进DRL方法。虽然论文涉及AI在商业领域的应用，但所有关键词均与大模型、深度学习技术原理或AI for Science（生物信息学/化学信息学）直接相关，而本文的核心是强化学习，未涉及任何大模型技术、训练方法、推理优化、对齐技术或科学AI应用，因此所有关键词得分为0。

!!! tip deepseek-chat TL;DR

该论文研究了如何通过引入基于经典库存概念（如"Base Stock"）的策略正则化来改进深度强化学习（DRL）方法在库存管理中的应用，从而加速超参数调优并提升最终性能，并在阿里巴巴天猫平台上实现了100%的部署。

摘要翻译

深度强化学习（Deep Reinforcement Learning，DRL）为训练能够利用大数据和计算能力的库存策略提供了一种通用方法。然而，现成的DRL实现效果参差不齐，常常受困于对训练所用超参数的高度敏感性。本文表明，通过施加基于经典库存概念（如“基准库存水平”）的策略正则化，我们能够显著加速超参数调优，并提升多种DRL方法的最终性能。我们报告了在阿里巴巴电商平台天猫上，采用策略正则化的DRL系统100%部署的详细情况。同时，我们进行了广泛的模拟实验，结果表明策略正则化重塑了关于何为库存管理最佳DRL方法的讨论。

摘要 (Abstract)

Deep Reinforcement Learning (DRL) provides a general-purpose methodology for training inventory policies that can leverage big data and compute. However, off-the-shelf implementations of DRL have seen mixed success, often plagued by high sensitivity to the hyperparameters used during training. In this paper, we show that by imposing policy regularizations, grounded in classical inventory concepts such as “Base Stock”, we can significantly accelerate hyperparameter tuning and improve the final performance of several DRL methods. We report details from a 100% deployment of DRL with policy regularizations on Alibaba’s e-commerce platform, Tmall. We also include extensive synthetic experiments, which show that policy regularizations reshape the narrative on what is the best DRL method for inventory management.

关键词: Deep Reinforcement Learning, Inventory Management, Policy Regularizations, Base Stock, Hyperparameter Tuning, Alibaba Tmall, Synthetic Experiments, DRL Methods

83. ❌ CAF-Score: Calibrating CLAP with LALMs for Reference-free Audio Captioning Evaluation

作者: Insung Lee, Taeyoung Jeong, Haejun Yoo, Du-Seong Chang, Myoung-Wan Koo 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19615v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文提出CAF-Score，一种结合CLAP和LALMs的无参考音频字幕评估方法。与关键词高度相关：‘Large Language Models’（8分）因使用LALMs；‘Chain of Thought’和’System 2 Thinking’（各8分）因利用LALMs进行细粒度推理；‘Hallucination Mitigation’（10分）为核心，直接解决幻觉检测问题；‘AI for Science’（5分）因应用于音频字幕评估，属于AI在科学领域的应用。其他关键词如MoE、SFT、RAG等未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文针对音频字幕评估中参考指标成本高且难以评估声学保真度的问题，提出CAF-Score，一种通过校准CLAP与LALMs的无参考评估方法，有效检测句法不一致和细微幻觉，在BRACE基准测试中实现了与人类判断的最高相关性。

摘要翻译

尽管大型音频语言模型（LALMs）在音频描述生成方面取得了进展，但其稳健评估仍面临困难。基于参考文本的评估方法成本高昂，且往往难以衡量声学保真度；而基于对比语言-音频预训练（CLAP）的方法则常常忽略句法错误和细粒度细节。我们提出CAF-Score这一免参考评估指标，它通过校准CLAP的粗粒度语义对齐与LALMs的细粒度理解及句法感知能力来实现优化。该方法融合对比音频-文本嵌入与LALM推理机制，能有效检测句法不一致性和细微的幻觉生成现象。在BRACE基准测试上的实验表明，我们的方法获得了与人类判断最高的相关性，甚至在复杂场景中超越了基于参考文本的基线模型。这些结果凸显了CAF-Score在免参考音频描述评估中的有效性。代码与结果详见https://github.com/inseong00/CAF-Score。

摘要 (Abstract)

While Large Audio-Language Models (LALMs) have advanced audio captioning, robust evaluation remains difficult. Reference-based metrics are expensive and often fail to assess acoustic fidelity, while Contrastive Language-Audio Pretraining (CLAP)-based approaches frequently overlook syntactic errors and fine-grained details. We propose CAF-Score, a reference-free metric that calibrates CLAP’s coarse-grained semantic alignment with the fine-grained comprehension and syntactic awareness of LALMs. By combining contrastive audio-text embeddings with LALM reasoning, CAF-Score effectively detects syntactic inconsistencies and subtle hallucinations. Experiments on the BRACE benchmark demonstrate that our approach achieves the highest correlation with human judgments, even outperforming reference-based baselines in challenging scenarios. These results highlight the efficacy of CAF-Score for reference-free audio captioning evaluation. Code and results are available at https://github.com/inseong00/CAF-Score.

关键词: CAF-Score, Large Audio-Language Models, reference-free evaluation, audio captioning, hallucination detection, CLAP calibration, syntactic awareness, BRACE benchmark

84. ❌ LoD-Loc v3: Generalized Aerial Localization in Dense Cities using Instance Silhouette Alignment

作者: Shuaibang Peng, Juelin Zhu, Xia Li, Kun Yang, Maojun Zhang, Yu Liu, Shen Yan 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19609v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的无人机视觉定位技术，通过实例分割和合成数据生成解决密集城市环境中的定位问题。论文内容完全不涉及大语言模型、深度学习技术原理或AI在科学领域的应用，所有关键词均与大模型、深度学习技术、AI科学应用无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为LoD-Loc v3的新方法，通过实例轮廓对齐和合成数据生成，解决了密集城市环境中无人机视觉定位的泛化性和精度问题，在跨场景和密集城市场景中显著优于现有方法。

摘要翻译

本文提出LoD-Loc v3，一种适用于密集城市场景的广义航空视觉定位新方法。先前工作LoD-Loc v2通过将语义建筑轮廓与低细节城市模型进行对齐来实现定位，但其存在两个关键局限：跨场景泛化能力差以及在密集建筑场景中频繁失效。我们的方法通过两项关键创新应对这些挑战。首先，我们开发了新的合成数据生成流程，构建了InsLoD-Loc——迄今为止最大的航空影像实例分割数据集，包含10万张带有精确建筑实例标注的图像。这使得训练模型展现出卓越的零样本泛化能力。其次，我们通过从语义轮廓对齐转向实例轮廓对齐，重构了定位范式，从而显著降低了密集场景中的位姿估计歧义性。大量实验表明，LoD-Loc v3在跨场景和密集城市场景中均以显著优势超越现有先进基准方法，实现了更优性能。项目发布于https://nudt-sawlab.github.io/LoD-Locv3/。

摘要 (Abstract)

We present LoD-Loc v3, a novel method for generalized aerial visual localization in dense urban environments. While prior work LoD-Loc v2 achieves localization through semantic building silhouette alignment with low-detail city models, it suffers from two key limitations: poor cross-scene generalization and frequent failure in dense building scenes. Our method addresses these challenges through two key innovations. First, we develop a new synthetic data generation pipeline that produces InsLoD-Loc - the largest instance segmentation dataset for aerial imagery to date, comprising 100k images with precise instance building annotations. This enables trained models to exhibit remarkable zero-shot generalization capability. Second, we reformulate the localization paradigm by shifting from semantic to instance silhouette alignment, which significantly reduces pose estimation ambiguity in dense scenes. Extensive experiments demonstrate that LoD-Loc v3 outperforms existing state-of-the-art (SOTA) baselines, achieving superior performance in both cross-scene and dense urban scenarios with a large margin. The project is available at https://nudt-sawlab.github.io/LoD-Locv3/.

关键词: aerial visual localization, instance silhouette alignment, synthetic data generation, dense urban environments, zero-shot generalization, pose estimation, instance segmentation dataset, cross-scene generalization

85. ❌ FB-CLIP: Fine-Grained Zero-Shot Anomaly Detection with Foreground-Background Disentanglement

作者: Ming Hu, Yongsheng Huo, Mingyu Dou, Jianfu Yin, Peng Zhao, Yao Wang, Cong Hu, Bingliang Hu, Quan Wang 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19608v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文FB-CLIP专注于计算机视觉领域的细粒度零样本异常检测，提出了一种结合多策略文本表示和前景-背景分离的框架。所有关键词均与大语言模型（LLM）或深度学习技术原理相关，而本文的核心是基于视觉-语言模型CLIP的改进，属于计算机视觉应用，并非大模型技术本身的研究。因此，除’AI for Science OR Bioinformatics OR Cheminformatics’（因涉及工业/医疗应用，有一定关联，给5分）外，其他关键词均与大模型技术原理或创新无关，得0分。

!!! tip deepseek-chat TL;DR

该论文针对工业与医疗场景中细粒度异常检测的零样本挑战，提出了FB-CLIP框架，通过前景-背景解耦和多策略文本表示，有效提升了异常定位的准确性。

摘要翻译

细粒度异常检测在工业与医疗应用中至关重要，但标注异常样本往往稀缺，使得零样本检测面临挑战。尽管如CLIP等视觉-语言模型提供了有前景的解决方案，它们仍受限于前景-背景特征纠缠与粗粒度的文本语义。我们提出FB-CLIP框架，该框架通过多策略文本表征与前景-背景分离来增强异常定位能力。在文本模态中，它结合了End-of-Text特征、全局池化表征与注意力加权的词元特征，以获取更丰富的语义线索。在视觉模态中，沿身份、语义和空间维度的多视角软分离，结合背景抑制，减少了干扰并提升了判别能力。语义一致性正则化（Semantic Consistency Regularization, SCR）将图像特征与正常及异常的文本原型对齐，抑制不确定匹配并扩大语义间隙。实验表明，FB-CLIP能有效区分复杂背景中的异常，在零样本设定下实现了精确的细粒度异常检测与定位。

摘要 (Abstract)

Fine-grained anomaly detection is crucial in industrial and medical applications, but labeled anomalies are often scarce, making zero-shot detection challenging. While vision-language models like CLIP offer promising solutions, they struggle with foreground-background feature entanglement and coarse textual semantics. We propose FB-CLIP, a framework that enhances anomaly localization via multi-strategy textual representations and foreground-background separation. In the textual modality, it combines End-of-Text features, global-pooled representations, and attention-weighted token features for richer semantic cues. In the visual modality, multi-view soft separation along identity, semantic, and spatial dimensions, together with background suppression, reduces interference and improves discriminability. Semantic Consistency Regularization (SCR) aligns image features with normal and abnormal textual prototypes, suppressing uncertain matches and enlarging semantic gaps. Experiments show that FB-CLIP effectively distinguishes anomalies from complex backgrounds, achieving accurate fine-grained anomaly detection and localization under zero-shot settings.

关键词: fine-grained anomaly detection, zero-shot detection, vision-language models, foreground-background disentanglement, CLIP, anomaly localization, semantic consistency regularization, multi-view soft separation

86. ❌ Physics-Informed Neural Network with Adaptive Clustering Learning Mechanism for Information Popularity Prediction

作者: Guangyin Jin, Xiaohan Ni, Yanjie Song, Kun Wei, Jie Zhao, Leiming Jia, Witold Pedrycz 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19599v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究信息级联流行度预测，提出了一种结合物理信息神经网络和自适应聚类学习机制的模型（PIACN）。虽然属于深度学习在信息传播领域的应用，但论文内容与所有评分关键词均无直接关联：1）未涉及大语言模型（LLMs）、小语言模型（SLMs）或基础模型；2）未提及MoE、缩放定律、预训练、后训练、对齐、RLHF、PEFT、RAG、上下文窗口扩展、注意力优化、推理方法（如CoT）、代理系统、模型压缩、推理加速、幻觉缓解、可解释性、世界模型、模型合并、上下文学习等具体技术；3）虽属AI应用，但非生物信息学或化学信息学等科学领域。论文核心是物理信息神经网络和自适应聚类机制在信息传播预测中的应用，与评分关键词列表中的大模型及深度学习技术原理创新无交集。

!!! tip deepseek-chat TL;DR

该论文提出了一种结合物理信息神经网络和自适应聚类学习机制（PIACN）的模型，用于预测信息级联的流行度，并在三个真实数据集上显著优于现有方法。

摘要翻译

随着社会进入互联网时代，数据与信息的规模与传播速度持续增长。预测信息级联的流行度有助于互联网平台实现高价值信息精准推送与舆情监控。当前预测信息流行度的前沿模型主要利用图卷积网络（GCNs）和循环神经网络（RNNs）等深度学习方法，通过捕捉早期级联特征与时间动态来预测其流行度增量。然而，现有方法多聚焦于信息级联的微观特征，忽视了其宏观传播模式的一般性规律，同时也缺乏对信息异质性影响传播流行度的考量。为克服这些局限，我们提出了一种融合自适应聚类学习机制的物理信息神经网络模型PIACN，用于预测信息级联的流行度。该模型首次通过物理信息方法对信息传播的宏观模式进行建模，并借助自适应聚类学习机制考量信息异质性的影响。在三个真实数据集上的大量实验结果表明，我们的模型在信息流行度预测任务上显著优于其他现有先进方法。

摘要 (Abstract)

With society entering the Internet era, the volume and speed of data and information have been increasing. Predicting the popularity of information cascades can help with high-value information delivery and public opinion monitoring on the internet platforms. The current state-of-the-art models for predicting information popularity utilize deep learning methods such as graph convolution networks (GCNs) and recurrent neural networks (RNNs) to capture early cascades and temporal features to predict their popularity increments. However, these previous methods mainly focus on the micro features of information cascades, neglecting their general macroscopic patterns. Furthermore, they also lack consideration of the impact of information heterogeneity on spread popularity. To overcome these limitations, we propose a physics-informed neural network with adaptive clustering learning mechanism, PIACN, for predicting the popularity of information cascades. Our proposed model not only models the macroscopic patterns of information dissemination through physics-informed approach for the first time but also considers the influence of information heterogeneity through an adaptive clustering learning mechanism. Extensive experimental results on three real-world datasets demonstrate that our model significantly outperforms other state-of-the-art methods in predicting information popularity.

关键词: Physics-Informed Neural Network, Adaptive Clustering Learning, Information Popularity Prediction, Information Cascades, Macroscopic Patterns, Information Heterogeneity, Deep Learning, Social Network Analysis

87. ❌ ARMOR: Adaptive Resilience Against Model Poisoning Attacks in Continual Federated Learning for Mobile Indoor Localization

作者: Danish Gufran, Akhil Singampalli, Sudeep Pasricha 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19594v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于联邦学习（FL）和持续联邦学习（CFL）在移动室内定位中的应用，提出ARMOR框架来防御模型投毒攻击。所有评分关键词均与大模型（LLM）技术、深度学习原理创新或AI在科学领域的应用直接相关，而本文研究的是联邦学习框架和定位系统，未涉及大模型技术、深度学习创新或生物医药等科学AI应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

论文提出ARMOR框架，通过状态空间模型监测和预测全局模型权重演化，在持续联邦学习环境中有效防御模型投毒攻击，实验表明在真实条件下将平均误差降低8.0倍，最坏情况误差降低4.97倍。

摘要翻译

室内定位在从资产追踪到个性化服务递送等应用场景中日益重要。联邦学习（FL）通过利用移动设备的分布式数据训练集中式全局模型（GM），且无需共享原始数据，提供了一种隐私保护方法。然而，实际部署需要持续联邦学习（CFL）环境，其中全局模型在设备异构性和动态变化的室内环境下持续接收更新。在此类动态条件下，错误或有偏差的更新可能导致全局模型偏离其预期学习轨迹，逐步劣化其内部表征与定位性能。这种脆弱性在对抗性模型投毒攻击下会进一步加剧。为应对这一挑战，我们提出ARMOR，一种基于持续联邦学习的新型框架，用于在持续更新过程中监控并保护全局模型。ARMOR引入了一种新颖的状态空间模型（SSM），该模型学习全局模型权重张量的历史演化规律，并预测其权重张量的预期下一状态。通过将传入的本地更新与此SSM预测进行比对，ARMOR能够在本地更新与全局模型聚合前检测偏差并选择性过滤受损更新。该机制使系统能够稳健适应时变环境动态，缓解模型投毒攻击的影响，同时防止全局模型损坏。真实环境下的实验评估表明，相较于最先进的室内定位框架，ARMOR实现了显著提升：平均误差降低最高达8.0倍，最差情况误差降低达4.97倍，这证明了其在使用真实数据和移动设备测试的模型损坏场景下具有强大的抗损能力。

摘要 (Abstract)

Indoor localization has become increasingly essential for applications ranging from asset tracking to delivering personalized services. Federated learning (FL) offers a privacy-preserving approach by training a centralized global model (GM) using distributed data from mobile devices without sharing raw data. However, real-world deployments require a continual federated learning (CFL) setting, where the GM receives continual updates under device heterogeneity and evolving indoor environments. In such dynamic conditions, erroneous or biased updates can cause the GM to deviate from its expected learning trajectory, gradually degrading internal GM representations and GM localization performance. This vulnerability is further exacerbated by adversarial model poisoning attacks. To address this challenge, we propose ARMOR, a novel CFL-based framework that monitors and safeguards the GM during continual updates. ARMOR introduces a novel state-space model (SSM) that learns the historical evolution of GM weight tensors and predicts the expected next state of weight tensors of the GM. By comparing incoming local updates with this SSM projection, ARMOR detects deviations and selectively mitigates corrupted updates before local updates are aggregated with the GM. This mechanism enables robust adaptation to temporal environmental dynamics and mitigate the effects of model poisoning attacks while preventing GM corruption. Experimental evaluations in real-world conditions indicate that ARMOR achieves notable improvements, with up to 8.0x reduction in mean error and 4.97x reduction in worst-case error compared to state-of-the-art indoor localization frameworks, demonstrating strong resilience against model corruption tested using real-world data and mobile devices.

关键词: Continual Federated Learning, Model Poisoning Attacks, Indoor Localization, State-Space Model, Mobile Devices, Adaptive Resilience, Weight Tensor Prediction, Real-world Evaluation

88. ❌ Data-driven ensemble prediction of the global ocean

作者: Qiusheng Huang, Xiaohui Zhong, Anboyu Guo, Ziyi Peng, Lei Chen, Hao Li 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19591v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于海洋科学领域的机器学习应用，具体开发了FuXi-ONS系统进行全球海洋概率预测。所有关键词均与大语言模型（LLM）、深度学习技术原理或特定AI技术（如MoE、RLHF、RAG等）直接相关，而本文研究的是传统机器学习在海洋预测中的应用，未涉及LLM、深度学习创新或所列具体技术。唯一略有相关的是’AI for Science’，因为论文属于AI在科学（海洋学）领域的应用，但并非核心匹配，故给5分（有一定关联）。其他关键词完全无关，均给0分。

!!! tip deepseek-chat TL;DR

该论文解决了机器学习在概率性全球海洋预测中的挑战，通过开发FuXi-ONS系统实现了高效、准确的5天全球海洋预报，性能优于传统方法且计算速度大幅提升。

摘要翻译

数据驱动模型已推进了确定性海洋预报的发展，但将机器学习扩展至概率性全球海洋预测仍是一个开放的挑战。本文介绍了FuXi-ONS——首个面向全球海洋的机器学习集合预报系统，可在全球1°网格上提供长达365天的海表温度、海表高度、次表层温度、盐度及洋流的5天预报。FuXi-ONS不依赖于计算成本高昂的数值模型的重复积分，而是学习具有物理结构的扰动，并融入大气编码模块以稳定长期预报。基于GLORYS12再分析数据的评估表明，相较于确定性及噪声扰动基线，FuXi-ONS在集合平均技巧和概率预报质量上均有提升；在海表温度及Niño3.4变率的预测上，其表现与成熟的季节性预报参考系统相当，同时运行速度比传统集合系统快数个数量级。这些成果为机器学习推动海洋科学核心问题研究提供了有力例证，并为实现高效的概率性海洋预报与气候风险评估开辟了一条实用路径。

摘要 (Abstract)

Data-driven models have advanced deterministic ocean forecasting, but extending machine learning to probabilistic global ocean prediction remains an open challenge. Here we introduce FuXi-ONS, the first machine-learning ensemble forecasting system for the global ocean, providing 5-day forecasts on a global 1° grid up to 365 days for sea-surface temperature, sea-surface height, subsurface temperature, salinity and ocean currents. Rather than relying on repeated integration of computationally expensive numerical models, FuXi-ONS learns physically structured perturbations and incorporates an atmospheric encoding module to stabilize long-range forecasts. Evaluated against GLORYS12 reanalysis, FuXi-ONS improves both ensemble-mean skill and probabilistic forecast quality relative to deterministic and noise-perturbed baselines, and shows competitive performance against established seasonal forecast references for SST and Niño3.4 variability, while running orders of magnitude faster than conventional ensemble systems. These results provide a strong example of machine learning advancing a core problem in ocean science, and establish a practical path toward efficient probabilistic ocean forecasting and climate risk assessment.

关键词: machine learning, ensemble forecasting, global ocean prediction, probabilistic forecasting, FuXi-ONS, sea-surface temperature, ocean currents, climate risk assessment

89. ❌ Skilled AI Agents for Embedded and IoT Systems Development

作者: Yiming Li, Yuhan Cheng, Mingchen Ma, Yihang Zou, Ningyuan Yang, Wei Cheng, Hai “Helen” Li, Yiran Chen, Tingjun Chen 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19583v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文明确提到使用LLMs和agentic systems进行嵌入式系统开发，因此与’Large Language Models OR LLMs OR Foundation Models’和’LLM Agents OR Autonomous Agents OR Agentic Workflow’高度相关（10分）。论文专注于嵌入式/IoT系统的具体应用，未涉及其他关键词的技术原理或创新，因此其他关键词得0分。

!!! tip deepseek-chat TL;DR

该论文针对LLM代理在硬件在环嵌入式/IoT系统开发中因软硬件耦合导致的部署失败问题，提出了基于技能的代理框架和IoT-SkillsBench基准，并通过硬件验证实验表明人类专家技能能实现接近完美的成功率。

摘要翻译

大语言模型（LLM）与智能体系统在自动化软件开发领域已展现出潜力，但将其应用于硬件在环（HIL）嵌入式系统与物联网（IoT）系统仍面临挑战，这主要源于软件逻辑与物理硬件行为之间的紧密耦合。成功编译的代码在部署到实际设备时仍可能失败，原因包括时序约束、外设初始化要求或硬件特定行为。为应对这一挑战，我们提出了一种基于技能的智能体框架，用于HIL嵌入式开发，并同时引入了IoT-SkillsBench——一个旨在真实嵌入式编程环境中系统评估AI智能体的基准测试集。IoT-SkillsBench涵盖三个代表性嵌入式平台、23种外设以及跨越三个难度等级的42项任务，其中每项任务均在三种智能体配置（无技能、LLM生成技能、人类专家技能）下进行评估，并通过真实硬件执行进行验证。在总计378次硬件验证实验中，我们发现，结合结构化专家知识的简洁人类专家技能能够使跨平台任务达成近乎完美的成功率。

摘要 (Abstract)

Large language models (LLMs) and agentic systems have shown promise for automated software development, but applying them to hardware-in-the-loop (HIL) embedded and Internet-of-Things (IoT) systems remains challenging due to the tight coupling between software logic and physical hardware behavior. Code that compiles successfully may still fail when deployed on real devices because of timing constraints, peripheral initialization requirements, or hardware-specific behaviors. To address this challenge, we introduce a skills-based agentic framework for HIL embedded development together with IoT-SkillsBench, a benchmark designed to systematically evaluate AI agents in real embedded programming environments. IoT-SkillsBench spans three representative embedded platforms, 23 peripherals, and 42 tasks across three difficulty levels, where each task is evaluated under three agent configurations (no-skills, LLM-generated skills, and human-expert skills) and validated through real hardware execution. Across 378 hardware validated experiments, we show that concise human-expert skills with structured expert knowledge enable near-perfect success rates across platforms.

关键词: Large Language Models, LLMs, agentic systems, embedded systems, IoT systems, hardware-in-the-loop, skills-based framework, IoT-SkillsBench

90. ❌ PA2D-MORL: Pareto Ascent Directional Decomposition based Multi-Objective Reinforcement Learning

作者: Tianmeng Hu, Biao Luo 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19579v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于多目标强化学习（MORL）方法，提出了一种基于帕累托上升方向分解的算法（PA2D-MORL），用于解决多目标决策问题，并在机器人控制任务中验证了其性能。论文内容完全围绕强化学习算法设计、多目标优化、策略梯度方法等传统强化学习主题，未涉及任何大语言模型、深度学习技术原理、大模型应用或AI for Science等关键词。所有关键词均与大模型、深度学习技术或特定科学领域应用相关，而本文是纯粹的强化学习方法研究，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于帕累托上升方向分解的多目标强化学习方法（PA2D-MORL），有效解决了多目标决策问题中帕累托策略集近似质量低的挑战，并在机器人控制任务中超越了现有最优算法。

摘要翻译

多目标强化学习（MORL）为涉及冲突目标的决策问题提供了有效的解决方案。然而，实现对帕累托策略集的高质量逼近仍然具有挑战性，尤其是在状态-动作空间连续或高维的复杂任务中。本文提出基于帕累托上升方向分解的多目标强化学习（PA2D-MORL）方法，该方法构建了一种高效的多目标问题分解与策略改进方案，从而实现对帕累托策略集的更优逼近。所提方法利用帕累托上升方向选择标量化权重，并计算多目标策略梯度，以此确定策略优化方向并确保所有目标的协同改进。同时，在进化框架下有选择地优化多个策略，以从不同方向逼近帕累托前沿。此外，采用帕累托自适应微调方法以增强帕累托前沿逼近的密度与分布广度。在多类多目标机器人控制任务上的实验表明，所提方法在结果的质量与稳定性方面均明显优于当前最先进的算法。

摘要 (Abstract)

Multi-objective reinforcement learning (MORL) provides an effective solution for decision-making problems involving conflicting objectives. However, achieving high-quality approximations to the Pareto policy set remains challenging, especially in complex tasks with continuous or high-dimensional state-action space. In this paper, we propose the Pareto Ascent Directional Decomposition based Multi-Objective Reinforcement Learning (PA2D-MORL) method, which constructs an efficient scheme for multi-objective problem decomposition and policy improvement, leading to a superior approximation of Pareto policy set. The proposed method leverages Pareto ascent direction to select the scalarization weights and computes the multi-objective policy gradient, which determines the policy optimization direction and ensures joint improvement on all objectives. Meanwhile, multiple policies are selectively optimized under an evolutionary framework to approximate the Pareto frontier from different directions. Additionally, a Pareto adaptive fine-tuning approach is applied to enhance the density and spread of the Pareto frontier approximation. Experiments on various multi-objective robot control tasks show that the proposed method clearly outperforms the current state-of-the-art algorithm in terms of both quality and stability of the outcomes.

关键词: Multi-objective reinforcement learning, Pareto policy set, Pareto ascent direction, Policy gradient, Evolutionary framework, Robot control, Multi-objective optimization, Scalarization weights

作者: Soorya Ram Shimgekar, Vipin Gunda, Jiwon Kim, Violeta J. Rodriguez, Hari Sundaram, Koustuv Saha 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19574v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	5.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究对话式AI（特别是GPT、LLaMA、Qwen等大语言模型）与用户妄想相关语言的互动关系，属于大模型在心理健康/安全领域的应用研究。核心相关关键词是’Large Language Models’（10分），因为论文直接测试了多个LLM家族。‘Self-Correction’和’Hallucination Mitigation’各得5分，因为论文涉及AI安全机制（状态感知安全机制）和减少风险，与这些概念有一定关联。其他关键词（如MoE、Scaling Laws、RLHF等）论文未涉及，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了与对话式AI的长时间互动是否会放大用户的妄想相关语言，发现来自有妄想相关历史用户的模拟用户在对话中表现出逐渐增强的妄想语言轨迹，而基于当前妄想评分调节AI响应可显著减少这种放大效应。

摘要翻译

对话式人工智能系统正日益被用于个人反思与情感倾诉，这引发了人们对其可能影响脆弱用户的担忧。近期个案报告表明，与人工智能的长期互动可能强化妄想思维——这一现象有时被称为“AI精神病”。然而，关于此现象的实证研究仍十分有限。本研究探讨了在与对话式人工智能的多轮互动中，与妄想相关的语言如何演变。我们基于Reddit用户的纵向发帖历史构建了模拟用户（SimUsers），并与三类模型家族（GPT、LLaMA和Qwen）生成了扩展对话。我们开发了DelusionScore（妄想指数），这是一种用于量化多轮对话中妄想相关语言强度的语言学指标。研究发现，源自先前存在妄想相关话语用户的模拟用户（实验组）表现出逐渐上升的DelusionScore轨迹，而源自无此类话语用户的模拟用户（对照组）则保持稳定或下降。我们进一步发现，这种强化效应在不同主题间存在差异，其中现实怀疑论和强迫性推理主题的上升最为显著。最后，将人工智能的回应条件设定于当前DelusionScore可显著降低这些上升轨迹。这些发现提供了实证证据，表明长期使用对话式人工智能可能放大与妄想相关的语言，并突显了基于状态感知的安全机制对于缓解此类风险的重要性。

摘要 (Abstract)

Conversational AI systems are increasingly used for personal reflection and emotional disclosure, raising concerns about their effects on vulnerable users. Recent anecdotal reports suggest that prolonged interactions with AI may reinforce delusional thinking – a phenomenon sometimes described as AI Psychosis. However, empirical evidence on this phenomenon remains limited. In this work, we examine how delusion-related language evolves during multi-turn interactions with conversational AI. We construct simulated users (SimUsers) from Reddit users’ longitudinal posting histories and generate extended conversations with three model families (GPT, LLaMA, and Qwen). We develop DelusionScore, a linguistic measure that quantifies the intensity of delusion-related language across conversational turns. We find that SimUsers derived from users with prior delusion-related discourse (Treatment) exhibit progressively increasing DelusionScore trajectories, whereas those derived from users without such discourse (Control) remain stable or decline. We further find that this amplification varies across themes, with reality skepticism and compulsive reasoning showing the strongest increases. Finally, conditioning AI responses on current DelusionScore substantially reduces these trajectories. These findings provide empirical evidence that conversational AI interactions can amplify delusion-related language over extended use and highlight the importance of state-aware safety mechanisms for mitigating such risks.

关键词: Conversational AI, Delusion-related language, AI Psychosis, Large Language Models, Safety mechanisms, Multi-turn interactions, Simulated users, DelusionScore

92. ❌ PFM-VEPAR: Prompting Foundation Models for RGB-Event Camera based Pedestrian Attribute Recognition

作者: Minghe Xu, Rouying Wu, ChiaWei Chu, Xiao Wang, Yu Li 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19565v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的多模态行人属性识别，提出了一种结合RGB和事件相机的框架，使用DCT/IDCT提取事件特征、Hopfield网络进行关联记忆增强、跨注意力融合等技术。虽然论文标题包含’Prompting Foundation Models’，但摘要和内容并未涉及任何大语言模型、深度学习技术原理创新或科学领域应用，所有关键词均与论文核心内容无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于RGB和事件相机的行人属性识别框架，通过轻量级DCT/IDCT提取事件特征、Hopfield网络增强关联记忆和跨注意力融合，有效提升了在低光照和运动模糊场景下的属性识别准确率。

摘要翻译

基于事件的行人属性识别（PAR）利用运动线索增强RGB相机在低光照与运动模糊场景下的性能，从而更准确地推断年龄、情绪等属性。然而，现有的双流多模态融合方法会引入显著的计算开销，且忽略了来自上下文样本的宝贵引导信息。为应对这些局限，本文提出一种事件提示器。该模块摒弃了计算成本高昂的辅助骨干网络，直接对事件数据应用极其轻量高效的离散余弦变换（DCT）与逆离散余弦变换（IDCT）操作。这一设计以极低计算代价提取频域事件特征，从而有效增强RGB分支。此外，本文设计了一个用于提供丰富先验知识的外部记忆库，结合现代Hopfield网络，实现了关联记忆增强的表征学习。该机制能有效挖掘并利用不同样本间的全局关联知识。最后，通过跨注意力机制融合RGB与事件模态，并经由前馈网络进行属性预测。在多个基准数据集上的大量实验充分验证了所提出的RGB-Event PAR框架的有效性。本文源代码将在https://github.com/Event-AHU/OpenPAR 公开。

摘要 (Abstract)

Event-based pedestrian attribute recognition (PAR) leverages motion cues to enhance RGB cameras in low-light and motion-blur scenarios, enabling more accurate inference of attributes like age and emotion. However, existing two-stream multimodal fusion methods introduce significant computational overhead and neglect the valuable guidance from contextual samples. To address these limitations, this paper proposes an Event Prompter. Discarding the computationally expensive auxiliary backbone, this module directly applies extremely lightweight and efficient Discrete Cosine Transform (DCT) and Inverse DCT (IDCT) operations to the event data. This design extracts frequency-domain event features at a minimal computational cost, thereby effectively augmenting the RGB branch. Furthermore, an external memory bank designed to provide rich prior knowledge, combined with modern Hopfield networks, enables associative memory-augmented representation learning. This mechanism effectively mines and leverages global relational knowledge across different samples. Finally, a cross-attention mechanism fuses the RGB and event modalities, followed by feed-forward networks for attribute prediction. Extensive experiments on multiple benchmark datasets fully validate the effectiveness of the proposed RGB-Event PAR framework. The source code of this paper will be released on https://github.com/Event-AHU/OpenPAR

关键词: pedestrian attribute recognition, event camera, RGB-Event fusion, Discrete Cosine Transform, Hopfield networks, cross-attention, multimodal learning, computer vision

93. ❌ Dual-Domain Representation Alignment: Bridging 2D and 3D Vision via Geometry-Aware Architecture Search

作者: Haoyu Zhang, Zhihao Yu, Rui Wang, Yaochu Jin, Qiqi Liu, Ran Cheng 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19563v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的神经架构搜索（NAS）和知识蒸馏技术，旨在优化大型视觉模型（LVMs）在边缘设备上的部署效率。论文的核心贡献包括：1）提出EvoNAS框架，结合进化算法进行多目标架构搜索；2）构建VSS-ViT混合超网络；3）开发CA-DDKD知识蒸馏策略以提升表示能力和排名一致性；4）引入DMMPE分布式评估框架加速验证。所有评分关键词均与大语言模型（LLMs）或大模型在科学领域的应用直接相关，而本文研究的是视觉模型（ViT、VSS等），属于不同的技术领域。论文未涉及任何LLM相关技术（如预训练、对齐、推理、代理等），也未提及生物信息学或化学信息学等科学AI应用。因此，所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对大型视觉模型在边缘设备上部署时的高推理成本问题，提出了一个高效的分布式进化神经架构搜索框架EvoNAS，通过混合超网络设计和跨架构双域知识蒸馏策略，成功搜索出在准确性和效率之间达到帕累托最优的视觉架构EvoNets。

摘要翻译

现代计算机视觉需要在预测精度与实时效率之间取得平衡，然而大型视觉模型的高推理成本限制了其在资源受限的边缘设备上的部署。尽管进化神经架构搜索天然适用于多目标优化，但其实际应用受到两大问题的阻碍：昂贵的候选模型评估以及子网络间的排序不一致性。为解决这些问题，我们提出了EvoNAS，一个面向多目标进化架构搜索的高效分布式框架。我们构建了一个融合视觉状态空间与视觉Transformer模块的混合超网络，并通过跨架构双领域知识蒸馏策略对其进行优化。通过结合VSS模块的计算效率与ViT模块的语义表达能力，CA-DDKD提升了共享超网络的表征能力并增强了排序一致性，从而在进化过程中无需额外微调即可实现可靠的适应度评估。为降低大规模验证成本，我们进一步提出了基于GPU资源池化与异步调度的分布式多模型并行评估框架。相比传统数据并行评估方法，DMMPE通过多GPU多模型并发执行将效率提升超过70%。在COCO、ADE20K、KITTI和NYU-Depth v2数据集上的实验表明，搜索得到的架构（命名为EvoNets）能够持续实现精度与效率间的帕累托最优权衡。与代表性的基于CNN、ViT及Mamba的模型相比，EvoNets在严格计算预算下实现了更低的推理延迟与更高的吞吐量，同时在新视角合成等下游任务中保持强大的泛化能力。代码已开源：https://github.com/EMI-Group/evonas

摘要 (Abstract)

Modern computer vision requires balancing predictive accuracy with real-time efficiency, yet the high inference cost of large vision models (LVMs) limits deployment on resource-constrained edge devices. Although Evolutionary Neural Architecture Search (ENAS) is well suited for multi-objective optimization, its practical use is hindered by two issues: expensive candidate evaluation and ranking inconsistency among subnetworks. To address them, we propose EvoNAS, an efficient distributed framework for multi-objective evolutionary architecture search. We build a hybrid supernet that integrates Vision State Space and Vision Transformer (VSS-ViT) modules, and optimize it with a Cross-Architecture Dual-Domain Knowledge Distillation (CA-DDKD) strategy. By coupling the computational efficiency of VSS blocks with the semantic expressiveness of ViT modules, CA-DDKD improves the representational capacity of the shared supernet and enhances ranking consistency, enabling reliable fitness estimation during evolution without extra fine-tuning. To reduce the cost of large-scale validation, we further introduce a Distributed Multi-Model Parallel Evaluation (DMMPE) framework based on GPU resource pooling and asynchronous scheduling. Compared with conventional data-parallel evaluation, DMMPE improves efficiency by over 70% through concurrent multi-GPU, multi-model execution. Experiments on COCO, ADE20K, KITTI, and NYU-Depth v2 show that the searched architectures, termed EvoNets, consistently achieve Pareto-optimal trade-offs between accuracy and efficiency. Compared with representative CNN-, ViT-, and Mamba-based models, EvoNets deliver lower inference latency and higher throughput under strict computational budgets while maintaining strong generalization on downstream tasks such as novel view synthesis. Code is available at https://github.com/EMI-Group/evonas

关键词: Evolutionary Neural Architecture Search, Multi-objective Optimization, Vision Transformer, Knowledge Distillation, Edge Computing, Model Efficiency, Distributed Evaluation, VSS-ViT Hybrid

94. ❌ Optimal Scalar Quantization for Matrix Multiplication: Closed-Form Density and Phase Transition

作者: Calvin Ang, Sungyoon Kim, Mert Pilanci 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19559v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究矩阵乘法的标量量化方法，推导了最优量化密度函数并发现了相关驱动的相变现象。与关键词高度相关的只有’Quantization OR Model Compression OR Low-bit Weights’（10分），因为论文专门研究量化技术。与’Large Language Models OR LLMs OR Foundation Models’（5分）和’Speculative Decoding OR Inference Acceleration’（5分）有一定关联，因为摘要最后提到该方法可应用于大语言模型激活值的量化，这属于模型压缩和推理加速范畴。其他关键词与论文内容无直接关系，均给0分。

!!! tip deepseek-chat TL;DR

该论文研究了矩阵乘法前的标量量化问题，推导了最小化均方误差的最优量化密度函数，发现了相关高斯乘法对中的相变现象，并展示了该方法在大语言模型激活值量化等应用中的适用性。

摘要翻译

我们研究矩阵乘法前对两个矩阵的逐元素标量量化问题。给定矩阵 $A\in R^{m\times k}$ 和 $B\in R^{k\times n}$，我们分别使用每元素 $K_X$ 级和 $K_Y$ 级的标量化器对 $A$ 和 $B$ 的条目进行独立量化，并计算 $\widehat C=\widehat A,\widehat B$。目标是在配对独立同分布内积模型下，最小化矩阵乘法的均方误差 $E[|{AB-\widehat A\widehat B}|F^2]$。在高分辨率条件 $K_X,K_Y\to\infty$ 下，我们推导出 $\mathcal{E}$ 的精确 $K^{-2}$ 阶渐近展开式，确定了最优的首项常数，并通过条件二阶矩刻画了渐近最优的量化中心密度。随后，我们聚焦于相关高斯乘法配对情形，得到闭式最优点密度 [ λ^\star(u)\ \propto\ \exp!\left(-\frac{u^2}{6}\right)\bigl((1-ρ^2)+ρ^2u^2\bigr)^{1/3}, \qquad u=\frac{x}{σ_X}, ] 其中 $y/σ_Y$ 具有相同形式，并证明了一个相关性驱动的相变：当 $|ρ|\leq 1/\sqrt{3}$ 时密度在原点处呈单峰分布，而当 $|ρ|>1/\sqrt{3}$ 时则转变为双峰分布，峰值位于 $u{\mathrm{peak}}=\pm\sqrt{3-1/ρ^2}$。我们通过合成实验（如矩阵乘法量化和最小二乘优化）以及大语言模型中键与查询激活值的量化，展示了所提方法的适用性。

摘要 (Abstract)

We study entrywise scalar quantization of two matrices prior to multiplication. Given $A\in R^{m\times k}$ and $B\in R^{k\times n}$, we quantize entries of $A$ and $B$ independently using scalar quantizers with $K_X$ and $K_Y$ levels per entry, and form $\widehat C=\widehat A,\widehat B$. The objective is to minimize the matrix multiplication mean-squared error (MSE) $E[|{AB-\widehat A\widehat B}|F^2]$ under a pair-i.i.d.\ inner-product model. In the high-resolution regime $K_X,K_Y\to\infty$, we derive a sharp $K^{-2}$ asymptotic expansion for $\mathcal{E}$, identify the exact optimal leading constants, and characterize asymptotically optimal quantization center densities in terms of conditional second moments. We then specialize to correlated Gaussian multiplicative pairs, obtaining a closed-form optimal point density [ λ^\star(u)\ \propto\ \exp!\left(-\frac{u^2}{6}\right)\bigl((1-ρ^2)+ρ^2u^2\bigr)^{1/3}, \qquad u=\frac{x}{σ_X}, ] with the same form for $y/σ_Y$, and prove a correlation-driven phase transition: the density is unimodal at the origin for $|ρ|\leq 1/\sqrt{3}$ and becomes bimodal for $|ρ|>1/\sqrt{3}$ with peaks at $u{\mathrm{peak}}=\pm\sqrt{3-1/ρ^2}$. We show our method’s applicability in synthetic experiments such as matrix multiplication quantization and least squares optimization, as well as quantization of large language model key and query activations.

关键词: scalar quantization, matrix multiplication, mean-squared error, optimal quantization density, correlation-driven phase transition, large language model quantization, inference acceleration, model compression

95. ❌ Plagiarism or Productivity? Students Moral Disengagement and Behavioral Intentions to Use ChatGPT in Academic Writing

作者: John Paul P. Miranda, Rhiziel P. Manalese, Mark Anthony A. Castro, Renen Paul M. Viado, Vernon Grace M. Maniago, Rudante M. Galapon, Jovita G. Rivera, Amado B. Martinez 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19549v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究ChatGPT在学术写作中的使用意图，属于大模型（LLM）的应用研究，因此与’Large Language Models OR LLMs OR Foundation Models’有一定关联（5分）。但论文聚焦于用户行为、道德脱离和心理因素，而非大模型技术本身（如架构、训练、推理优化等）或其在科学领域的创新应用，因此与其他所有技术性关键词（如MoE、Scaling Laws、PEFT、RAG等）完全无关（0分）。

!!! tip deepseek-chat TL;DR

该研究探讨了道德脱离机制如何影响菲律宾大学生在学术写作中使用ChatGPT的意图，发现归因责备等机制显著影响态度和控制感，而态度对行为意图影响最大，表明学生常以制度缺陷和同伴行为为使用AI辩护，提示需要明确的学术诚信政策。

摘要翻译

本研究探讨了道德推脱如何影响菲律宾大学生在学术写作中使用ChatGPT的意向。该模型检验了五种机制：道德辩护、委婉标签、责任转移、后果最小化和责任归因。这些机制被分析为对态度、主观规范和感知行为控制力的预测因子，进而预测行为意向。共有418名有ChatGPT使用经验的学生参与研究。结果显示，多种道德推脱机制影响了学生的态度和控制感。在所有预测因子中，责任归因的影响最强，而态度对行为意向的影响最大。该模型解释了超过一半的行为意向变异。这些结果表明，学生常依赖制度漏洞和同伴行为来为其使用人工智能的行为辩护。许多人认为将ChatGPT用于学习目的或在规则不明确时使用是可接受的。这表明需要制定明确的学术诚信政策、伦理指导和课堂支持。研究同时指出，基于意向的模型可能无法完全解释学生行为。情感因素、同伴影响和便利性也会影响决策。研究结果为旨在支持高等教育中负责任、有意识使用人工智能的院校提供了有益见解。

摘要 (Abstract)

This study examined how moral disengagement influences Filipino college students’ intention to use ChatGPT in academic writing. The model tested five mechanisms: moral justification, euphemistic labeling, displacement of responsibility, minimizing consequences, and attribution of blame. These mechanisms were analyzed as predictors of attitudes, subjective norms, and perceived behavioral control, which then predicted behavioral intention. A total of 418 students with ChatGPT experience participated. The results showed that several moral disengagement mechanisms influenced students’ attitudes and sense of control. Among the predictors, attribution of blame had the strongest influence, while attitudes had the highest impact on behavioral intention. The model explained more than half of the variation in intention. These results suggest that students often rely on institutional gaps and peer behavior to justify AI use. Many believe it is acceptable to use ChatGPT for learning or when rules are unclear. This shows a need for clear academic integrity policies, ethical guidance, and classroom support. The study also recognizes that intention-based models may not fully explain student behavior. Emotional factors, peer influence, and convenience can also affect decisions. The results provide useful insights for schools that aim to support responsible and informed AI use in higher education.

关键词: ChatGPT, academic writing, moral disengagement, behavioral intention, Filipino college students, academic integrity, ethical guidance, higher education

96. ❌ Subspace Kernel Learning on Tensor Sequences

作者: Lei Wang, Xi Ding, Yongsheng Gao, Piotr Koniusz 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19546v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《Subspace Kernel Learning on Tensor Sequences》提出了一种用于高阶张量序列的核学习方法（UKTL），专注于张量展开、子空间比较、可扩展核线性化和不确定性感知加权。其核心是张量核学习框架，应用于动作识别（NTU-60, NTU-120, Kinetics-Skeleton）。所有关键词均与大模型、深度学习技术原理或科学AI应用直接相关，但论文未涉及任何大模型（LLM/SLM）、训练方法（预训练/微调/对齐）、推理技术（CoT/注意力优化）、代理系统、模型压缩或特定科学领域（如生物信息学）。仅与’Mechanistic Interpretability OR Explainable AI’有微弱关联（因提到’interpretability’和’mode-wise insights’），但非核心，故给5分；其他关键词完全无关，均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种不确定性驱动的核张量学习框架（UKTL），用于处理高阶张量序列，通过比较张量展开得到的模式子空间并引入可扩展的Nyström核线性化及不确定性加权，在动作识别基准上实现了最先进的性能、优越的泛化能力和可解释的模式洞察。

摘要翻译

从以高阶张量表示的结构化多路数据中学习，需要在保持计算效率的同时捕捉张量模态间的复杂交互。我们提出不确定性驱动的核张量学习（UKTL），这是一种针对$M$模态张量的新型核框架，通过比较从张量展开导出的模态子空间，实现表达力强且鲁棒的相似性度量。为处理大规模张量数据，我们提出一种可扩展的Nyström核线性化方法，其中动态学习的枢轴张量通过软$k$-均值聚类获得。UKTL的核心创新在于其不确定性感知的子空间加权机制，该机制基于估计的置信度自适应降低不可靠模态分量的权重，从而提升输入张量与枢轴张量比较的鲁棒性和可解释性。我们的框架完全支持端到端训练，并通过结构化核组合自然地融合多路与多模态交互。在动作识别基准数据集（NTU-60、NTU-120、Kinetics-Skeleton）上的大量实验表明，UKTL实现了最先进的性能、优异的泛化能力以及有意义的模态维度分析。本研究为结构化多路与多模态张量序列建立了一种原则化、可扩展且可解释的核学习范式。

摘要 (Abstract)

Learning from structured multi-way data, represented as higher-order tensors, requires capturing complex interactions across tensor modes while remaining computationally efficient. We introduce Uncertainty-driven Kernel Tensor Learning (UKTL), a novel kernel framework for $M$-mode tensors that compares mode-wise subspaces derived from tensor unfoldings, enabling expressive and robust similarity measure. To handle large-scale tensor data, we propose a scalable Nyström kernel linearization with dynamically learned pivot tensors obtained via soft $k$-means clustering. A key innovation of UKTL is its uncertainty-aware subspace weighting, which adaptively down-weights unreliable mode components based on estimated confidence, improving robustness and interpretability in comparisons between input and pivot tensors. Our framework is fully end-to-end trainable and naturally incorporates both multi-way and multi-mode interactions through structured kernel compositions. Extensive evaluations on action recognition benchmarks (NTU-60, NTU-120, Kinetics-Skeleton) show that UKTL achieves state-of-the-art performance, superior generalization, and meaningful mode-wise insights. This work establishes a principled, scalable, and interpretable kernel learning paradigm for structured multi-way and multi-modal tensor sequences.

关键词: tensor sequences, kernel learning, subspace comparison, Nyström kernel linearization, uncertainty-aware weighting, action recognition, multi-way interactions, interpretability

97. ❌ FDARxBench: Benchmarking Regulatory and Clinical Reasoning on FDA Generic Drug Assessment

作者: Betty Xiong, Jillian Fisher, Benjamin Newman, Meng Hu, Shivangi Gupta, Yejin Choi, Lanyan Fang, Russ B Altman 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19539v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	8.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	7.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	8.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文FDARxBench专注于评估LLM在FDA药物标签文档上的问答能力，属于AI在生物医药科学领域的应用研究。核心相关关键词包括：‘Large Language Models’（论文明确评估LLM行为）、‘AI for Science’（生物医药应用）、‘Retrieval-Augmented Generation’（涉及文档检索和生成）、‘Hallucination Mitigation’（评估事实性和安全拒绝行为）、‘Context Window Extension’（涉及长上下文检索评估）。其他关键词如推理相关（CoT、System 2）有间接关联，但非核心；其余技术原理关键词（如MoE、量化、训练方法等）未涉及。

!!! tip deepseek-chat TL;DR

该研究创建了基于FDA药物标签的基准FDARxBench，用于评估语言模型在临床和监管文档上的问答能力，实验发现现有模型在事实性、长上下文检索和安全拒绝方面存在显著不足。

摘要翻译

我们基于美国食品药品监督管理局（FDA）药品说明书文件，以仿制药审评为背景，引入了一项由专家精心构建的真实世界基准，用于评估基于文档的问答系统。药品说明书包含丰富但异构的临床与监管信息，这使得当前语言模型难以实现精准问答。通过与FDA监管审评专家合作，我们提出了FDARxBench，并构建了一个多阶段流水线，用于生成涵盖事实性问答、多跳推理及拒绝应答任务的高质量、经专家审定的问答示例，同时设计了评估方案以检验开卷与闭卷推理能力。在专有模型和开源权重模型上的实验表明，现有模型在事实依据、长上下文检索和安全拒绝行为方面仍存在显著不足。尽管本基准的提出源于FDA仿制药审评需求，但它也为开展具有挑战性的、符合监管标准的标签理解评估奠定了重要基础。该基准旨在支持对大型语言模型在药品说明书问答任务中行为表现的评估。

摘要 (Abstract)

We introduce an expert curated, real-world benchmark for evaluating document-grounded question-answering (QA) motivated by generic drug assessment, using the U.S. Food and Drug Administration (FDA) drug label documents. Drug labels contain rich but heterogeneous clinical and regulatory information, making accurate question answering difficult for current language models. In collaboration with FDA regulatory assessors, we introduce FDARxBench, and construct a multi-stage pipeline for generating high-quality, expert curated, QA examples spanning factual, multi-hop, and refusal tasks, and design evaluation protocols to assess both open-book and closed-book reasoning. Experiments across proprietary and open-weight models reveal substantial gaps in factual grounding, long-context retrieval, and safe refusal behavior. While motivated by FDA generic drug assessment needs, this benchmark also provides a substantial foundation for challenging regulatory-grade evaluation of label comprehension. The benchmark is designed to support evaluation of LLM behavior on drug-label questions.

关键词: FDA drug labels, document-grounded QA, regulatory reasoning, clinical reasoning, benchmark evaluation, long-context retrieval, factual grounding, safe refusal

98. ❌ Reasoning Gets Harder for LLMs Inside A Dialogue

作者: Ivan Kartáč, Mateusz Lango, Ondřej Dušek 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20133v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在任务导向对话(TOD)中的推理能力，与’Large Language Models’高度相关(10分)。研究涉及算术、空间和时间推理，与’Chain of Thought’和’System 2 Thinking’高度相关(10分)。论文提到工具使用要求和对话交互，与’LLM Agents’和’Tool Use’有一定关联(5分)。其他关键词如MoE、量化、RAG等未在论文中涉及，得0分。

!!! tip deepseek-chat TL;DR

该研究发现大型语言模型在任务导向对话环境中的推理性能显著低于孤立任务设置，揭示了基准测试与真实交互场景之间的性能差距。

摘要翻译

大型语言模型（LLM）在许多推理基准测试中展现出强大性能，然而这些评估通常侧重于孤立任务，与面向任务对话（TOD）中的实际应用场景存在差异。在此类场景中，LLM必须在生成文本的同时进行内在推理，并遵循关于角色、格式和风格的指令。这种不匹配引发了人们对基准测试性能是否能准确反映LLM在TOD环境中推理鲁棒性的担忧。我们通过引入BOULDER——一个涵盖八项旅行相关任务的动态新基准——来研究在TOD框架内构建推理任务如何影响LLM性能。该基准要求模型进行兼具常识性与规范性的算术、空间及时间推理。每个问题均提供孤立版本和基于对话的变体，从而在控制数据污染的前提下实现受控比较。对八个LLM的实验表明，在孤立设置与对话设置之间存在显著且一致的性能差距。通过消融实验和定性分析，我们发现这种差距主要由对话的多轮交互特性驱动，同时角色条件设定和工具使用要求也会产生附加影响。我们的研究结果凸显了在现实交互场景中评估LLM推理能力的必要性。

摘要 (Abstract)

Large Language Models (LLMs) achieve strong performance on many reasoning benchmarks, yet these evaluations typically focus on isolated tasks that differ from real-world usage in task-oriented dialogue (TOD). In this setting, LLMs must perform reasoning inherently while generating text and adhering to instructions on role, format, and style. This mismatch raises concerns about whether benchmark performance accurately reflects models’ reasoning robustness in TOD setting. We investigate how framing reasoning tasks within TOD affects LLM performance by introducing BOULDER, a new dynamic benchmark covering eight travel-related tasks that require arithmetic, spatial, and temporal reasoning with both commonsense and formal aspects. Each problem is presented in both isolated and dialogue-based variants, enabling controlled comparison while mitigating data contamination. Experiments on eight LLMs reveal a substantial and consistent performance gap between isolated and dialogue settings. Through ablations and qualitative analysis, we show that this gap is largely driven by the multi-turn nature of dialogue, with additional effects from role conditioning and tool-use requirements. Our results highlight the need to evaluate LLM reasoning in realistic interactive scenarios.

关键词: Large Language Models, Reasoning, Task-oriented Dialogue, Benchmark Evaluation, Performance Gap, Multi-turn Dialogue, Tool Use, BOULDER Benchmark

99. ❌ Current LLMs still cannot ’talk much’ about grammar modules: Evidence from syntax

作者: Mohammed Q. Shormani 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20114v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文直接研究LLMs在语法模块翻译中的表现，与’Large Language Models’高度相关（10分）。研究涉及LLMs的准确性和事实性问题，与’Hallucination Mitigation’有一定关联（5分）。分析LLMs工作机制属于可解释性范畴，与’Mechanistic Interpretability’有一定关联（5分）。其他关键词如MoE、SFT、RAG等未在论文中涉及，均为0分。

!!! tip deepseek-chat TL;DR

该研究评估了大型语言模型在语法术语翻译中的表现，发现ChatGPT-5的翻译准确率仅为25%，揭示了LLMs在语法模块理解上的局限性，并建议AI专家与语言学家合作改进模型机制。

摘要翻译

本研究旨在探究大型语言模型（LLM）能在多大程度上对语法模块进行“深入阐释”，通过分析ChatGPT将句法核心属性术语翻译为阿拉伯语的案例提供证据。我们从生成句法学领域的既往著作（包括书籍与期刊论文）及自身研究经验中收集了44个术语，先由人工翻译，再经ChatGPT-5进行翻译，随后对两种译文进行对比分析。研究采用分析与比较相结合的方法。结果显示，大型语言模型仍难以对涉及多重句法与语义挑战的核心句法属性术语进行有效阐释：在ChatGPT的翻译中，仅25%完全准确，38.6%存在错误，36.4%部分正确（此类译文我们认为可视为恰当）。基于上述发现，我们提出一系列可行策略，其中最值得注意的是建议人工智能专家与语言学家紧密协作，以优化大型语言模型的工作机制，从而实现准确或至少恰当的翻译。

摘要 (Abstract)

We aim to examine the extent to which Large Language Models (LLMs) can ’talk much’ about grammar modules, providing evidence from syntax core properties translated by ChatGPT into Arabic. We collected 44 terms from generative syntax previous works, including books and journal articles, as well as from our experience in the field. These terms were translated by humans, and then by ChatGPT-5. We then analyzed and compared both translations. We used an analytical and comparative approach in our analysis. Findings unveil that LLMs still cannot ’talk much’ about the core syntax properties embedded in the terms under study involving several syntactic and semantic challenges: only 25% of ChatGPT translations were accurate, while 38.6% were inaccurate, and 36.4.% were partially correct, which we consider appropriate. Based on these findings, a set of actionable strategies were proposed, the most notable of which is a close collaboration between AI specialists and linguists to better LLMs’ working mechanism for accurate or at least appropriate translation.

关键词: Large Language Models, LLMs, ChatGPT, syntax translation, grammar modules, translation accuracy, AI-linguistics collaboration, language model evaluation

作者: Yu Wang, Olcay Türk, Angela Grimminger, Hendrik Buschmeier 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20079v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究对话中认知负荷相关的语言线索（如信息价值、句法复杂性、注视行为）与听者理解状态的关系，使用统计分析和BERT分类器进行预测。所有关键词均涉及大模型/深度学习技术原理或科学应用，而本文属于认知科学/人机交互领域，未涉及任何大模型技术、训练方法、推理优化、代理系统或科学AI应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该研究通过分析说话者话语的信息价值、句法复杂性和听者注视行为变化，预测听者在解释性对话中的理解状态（理解、部分理解、不理解、误解），并证明结合这些语言线索能提高状态分类的准确性。

摘要翻译

本研究探讨对话中说话者与听者所呈现的语言及非语言特征，如何逐时刻预测解释性互动中听者的理解状态。具体而言，我们考察了三种与认知负荷相关且被假设与听者理解程度相关的语言线索：说话者话语的信息价值（以信息熵量化）、句法复杂性，以及听者互动性注视行为的变化。基于对面对面对话式桌游解释语料库MUNDEX的统计分析，我们发现不同线索随听者理解水平的变化而呈现差异。听者的理解状态（“完全理解”、“部分理解”、“未理解”和“误解”）由听者通过回溯性视频回忆法进行自我标注。随后的分类实验结果表明，利用两种现成分类器和一个经过微调的基于德语BERT的多模态分类器，对这四种理解状态进行预测总体上是可行的，且当三种语言线索与文本特征结合时，预测性能得到提升。

摘要 (Abstract)

We investigate how verbal and nonverbal linguistic features, exhibited by speakers and listeners in dialogue, can contribute to predicting the listener’s state of understanding in explanatory interactions on a moment-by-moment basis. Specifically, we examine three linguistic cues related to cognitive load and hypothesised to correlate with listener understanding: the information value (operationalised with surprisal) and syntactic complexity of the speaker’s utterances, and the variation in the listener’s interactive gaze behaviour. Based on statistical analyses of the MUNDEX corpus of face-to-face dialogic board game explanations, we find that individual cues vary with the listener’s level of understanding. Listener states (‘Understanding’, ‘Partial Understanding’, ‘Non-Understanding’ and ‘Misunderstanding’) were self-annotated by the listeners using a retrospective video-recall method. The results of a subsequent classification experiment, involving two off-the-shelf classifiers and a fine-tuned German BERT-based multimodal classifier, demonstrate that prediction of these four states of understanding is generally possible and improves when the three linguistic cues are considered alongside textual features.

关键词: cognitive load, linguistic cues, listener understanding, dialogue, explanatory interactions, multimodal classification, BERT, MUNDEX corpus

101. ❌ RouterKGQA: Specialized–General Model Routing for Constraint-Aware Knowledge Graph Question Answering

作者: Bo Yuan, Hexuan Deng, Xuebo Liu, Min Zhang 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20017v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	8.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	8.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文RouterKGQA提出了一种专门-通用模型协作框架，用于知识图谱问答（KGQA），以减轻LLM幻觉。核心内容涉及：1）使用大型通用模型（LLM）作为代理进行KG引导修复（高度相关LLMs和LLM Agents）；2）使用小型专门模型生成推理路径（相关Small Language Models）；3）通过检索增强生成（RAG）和思维链（CoT）推理实现结构化知识基础；4）主要目标是减少幻觉（高度相关Hallucination Mitigation）。其他关键词如MoE、Scaling Laws、训练方法、系统优化、AI for Science等与论文内容无关。

!!! tip deepseek-chat TL;DR

论文提出RouterKGQA框架，通过专门模型和通用大模型协作进行知识图谱问答，在减少大模型调用次数的同时提高了答案准确性，有效缓解了幻觉问题。

摘要翻译

知识图谱问答（KGQA）是一种通过将推理建立在结构化且可验证的知识图谱上来缓解大语言模型幻觉的有效方法。现有方法可分为两种范式：基于检索的方法利用小型专用模型，其效率较高但常产生不可达路径并遗漏隐式约束；而基于智能体的方法利用大型通用模型，能以更强的结构 grounding 能力实现更优效果，但计算成本显著更高。我们提出了 RouterKGQA，一个专用模型与通用模型协同工作的框架。在该框架中，专用模型生成推理路径，而通用模型仅在必要时执行知识图谱引导的修正，从而以最小成本提升性能。我们进一步为专用模型配备了约束感知的答案过滤机制，以减少冗余答案。此外，我们设计了一种更高效的通用智能体工作流程，进一步降低了推理成本。实验结果表明，RouterKGQA 在多个基准测试中平均 F1 分数比先前最佳方法高出 3.57 分，Hits@1 高出 0.49 分，而每个问题平均仅需 1.15 次大语言模型调用。代码与模型已发布于 https://github.com/Oldcircle/RouterKGQA。

摘要 (Abstract)

Knowledge graph question answering (KGQA) is a promising approach for mitigating LLM hallucination by grounding reasoning in structured and verifiable knowledge graphs. Existing approaches fall into two paradigms: retrieval-based methods utilize small specialized models, which are efficient but often produce unreachable paths and miss implicit constraints, while agent-based methods utilize large general models, which achieve stronger structural grounding at substantially higher cost. We propose RouterKGQA, a framework for specialized–general model collaboration, in which a specialized model generates reasoning paths and a general model performs KG-guided repair only when needed, improving performance at minimal cost. We further equip the specialized with constraint-aware answer filtering, which reduces redundant answers. In addition, we design a more efficient general agent workflow, further lowering inference cost. Experimental results show that RouterKGQA outperforms the previous best by 3.57 points in F1 and 0.49 points in Hits@1 on average across benchmarks, while requiring only 1.15 average LLM calls per question. Codes and models are available at https://github.com/Oldcircle/RouterKGQA.

关键词: Knowledge Graph Question Answering, LLM Hallucination Mitigation, Specialized-General Model Collaboration, Retrieval-based Methods, Agent-based Methods, Constraint-aware Answer Filtering, Inference Cost Reduction, KG-guided Repair

102. ❌ ReViSQL: Achieving Human-Level Text-to-SQL

作者: Yuxuan Zhu, Tengjun Jin, Yoojin Choi, Daniel Kang 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20004v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	10.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型在Text-to-SQL任务中的应用，通过数据质量改进提升模型性能，因此与’Large Language Models’高度相关（10分）。论文强调数据质量对模型性能的关键作用，与’Scaling Laws AND Data Quality’高度相关（10分）。其他关键词如MoE、SLMs、PEFT、RAG等未在摘要中提及或与论文内容无关，均给0分。

!!! tip deepseek-chat TL;DR

该论文针对Text-to-SQL任务中AI模型与人类准确率存在差距的问题，提出ReViSQL框架，通过改进训练数据质量和推理时扩展方法，首次在BIRD基准上达到人类水平准确率。

摘要翻译

将自然语言转换为SQL（Text-to-SQL）是数据库研究和数据分析应用中的一个关键挑战。近期研究致力于通过开发大型语言模型和AI智能体来增强SQL推理能力，这些方法将Text-to-SQL任务分解为人工设计的逐步处理流程。然而，尽管进行了大量架构工程方面的努力，仍存在显著差距：即使在BIRD基准测试中，最先进的AI智能体也尚未达到人类水平的准确率。本文表明，缩小这一差距并不需要更复杂的架构，而是需要干净的训练数据来提升基础模型的SQL推理能力。
我们提出了ReViSQL，这是一个简化的框架，首次在BIRD基准上达到了人类水平的准确率。ReViSQL未采用复杂的AI智能体，而是基于我们构建的数据集BIRD-Verified，通过可验证奖励的强化学习进行优化。BIRD-Verified基于BIRD训练集构建，包含2500个经过验证的Text-to-SQL实例。为构建该数据集，我们设计了由SQL专家参与的数据校正与验证流程，在BIRD训练集的一个子集中识别并修正了61.1%的数据错误。实验表明，在相同的RLVR算法下，仅通过使用BIRD-Verified提升数据质量，单次生成准确率即可提高8.2%至13.9%。为进一步提升性能，ReViSQL通过基于执行的协调机制和多数投票进行推理时扩展。实证中，我们展示了两种模型规模的框架优势：ReViSQL-235B-A22B和ReViSQL-30B-A3B。在专家验证的BIRD Mini-Dev集上，ReViSQL-235B-A22B实现了93.2%的执行准确率，超过了代理人类水平准确率（92.96%），并较先前开源SOTA方法提升9.8%。轻量级的ReViSQL-30B-A3B在每查询成本降低7.5倍的条件下，达到了与先前SOTA相当的性能。

摘要 (Abstract)

Translating natural language to SQL (Text-to-SQL) is a critical challenge in both database research and data analytics applications. Recent efforts have focused on enhancing SQL reasoning by developing large language models and AI agents that decompose Text-to-SQL tasks into manually designed, step-by-step pipelines. However, despite these extensive architectural engineering efforts, a significant gap remains: even state-of-the-art (SOTA) AI agents have not yet achieved the human-level accuracy on the BIRD benchmark. In this paper, we show that closing this gap does not require further architectural complexity, but rather clean training data to improve SQL reasoning of the underlying models. We introduce ReViSQL, a streamlined framework that achieves human-level accuracy on BIRD for the first time. Instead of complex AI agents, ReViSQL leverages reinforcement learning with verifiable rewards (RLVR) on BIRD-Verified, a dataset we curated comprising 2.5k verified Text-to-SQL instances based on the BIRD Train set. To construct BIRD-Verified, we design a data correction and verification workflow involving SQL experts. We identified and corrected data errors in 61.1% of a subset of BIRD Train. By training on BIRD-Verified, we show that improving data quality alone boosts the single-generation accuracy by 8.2-13.9% under the same RLVR algorithm. To further enhance performance, ReViSQL performs inference-time scaling via execution-based reconciliation and majority voting. Empirically, we demonstrate the superiority of our framework with two model scales: ReViSQL-235B-A22B and ReViSQL-30B-A3B. On an expert-verified BIRD Mini-Dev set, ReViSQL-235B-A22B achieves 93.2% execution accuracy, exceeding the proxy human-level accuracy (92.96%) and outperforming the prior open-source SOTA method by 9.8%. Our lightweight ReViSQL-30B-A3B matches the prior SOTA at a 7.5$\times$ lower per-query cost.

关键词: Text-to-SQL, Large Language Models, Data Quality, Reinforcement Learning, BIRD Benchmark, Human-Level Accuracy, Execution Accuracy, Model Scaling

103. ❌ An Agentic Approach to Generating XAI-Narratives

作者: Yifan He, David Martens 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20003v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	8.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM生成可解释AI（XAI）叙事的多智能体框架，与’Large Language Models’、‘LLM Agents’、‘Multi-agent Systems’和’Mechanistic Interpretability’高度相关（10分），涉及’Self-Correction’（8分），其他关键词未涉及（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了一种多智能体框架来生成和改进可解释AI（XAI）叙事，通过迭代反馈机制显著提高了叙事的忠实度，其中Claude-4.5-Sonnet在基础设计中经过三轮迭代将不忠实叙事减少了90%。

摘要翻译

近年来，可解释人工智能（XAI）研究取得了显著进展。然而，现有的XAI方法常被批评为过于技术化且面向专家，这推动了更具可解释性和易于理解的解释方法的发展。为此，基于大语言模型（LLM）生成的XAI叙事被提出作为一种有前景的途径，能够将事后解释转化为更易于理解的自然语言解释。本研究提出了一种用于XAI叙事生成与优化的多智能体框架。该框架包含叙事者（Narrator），其根据多个批评智能体（Critic Agents）在忠实度与连贯性指标上的反馈来生成并修订叙事，从而通过迭代实现叙事改进。我们设计了五种智能体系统（基础设计、批评设计、批评规则设计、连贯设计及连贯规则设计），并在五个表格数据集上系统评估了它们跨五种大语言模型的有效性。结果证实，基础设计、批评设计及批评规则设计能有效提升所有大语言模型生成叙事的忠实度。其中，采用基础设计的Claude-4.5-Sonnet表现最佳，经过三轮迭代后将不忠实叙事的数量减少了90%。针对反复出现的问题，我们进一步引入了基于多数投票的集成策略。该方法持续提升了四种大语言模型的性能，但DeepSeek-V3.2-Exp除外。这些发现凸显了智能体系统在生成忠实且连贯的XAI叙事方面的潜力。

摘要 (Abstract)

Explainable AI (XAI) research has experienced substantial growth in recent years. Existing XAI methods, however, have been criticized for being technical and expert-oriented, motivating the development of more interpretable and accessible explanations. In response, large language model (LLM)-generated XAI narratives have been proposed as a promising approach for translating post-hoc explanations into more accessible, natural-language explanations. In this work, we propose a multi-agent framework for XAI narrative generation and refinement. The framework comprises the Narrator, which generates and revises narratives based on feedback from multiple Critic Agents on faithfulness and coherence metrics, thereby enabling narrative improvement through iteration. We design five agentic systems (Basic Design, Critic Design, Critic-Rule Design, Coherent Design, and Coherent-Rule Design) and systematically evaluate their effectiveness across five LLMs on five tabular datasets. Results validate that the Basic Design, the Critic Design, and the Critic-Rule Design are effective in improving the faithfulness of narratives across all LLMs. Claude-4.5-Sonnet on Basic Design performs best, reducing the number of unfaithful narratives by 90% after three rounds of iteration. To address recurrent issues, we further introduce an ensemble strategy based on majority voting. This approach consistently enhances performance for four LLMs, except for DeepSeek-V3.2-Exp. These findings highlight the potential of agentic systems to produce faithful and coherent XAI narratives.

关键词: Explainable AI, XAI narratives, large language models, multi-agent framework, faithfulness, coherence, agentic systems, iteration

104. ❌ When Contextual Inference Fails: Cancelability in Interactive Instruction Following

作者: Natalia Bila, Kata Naszádi, Alexandra Mayn, Christof Monz 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19997v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLMs在交互式指令跟随任务中的表现，直接涉及’Large Language Models’和’LLM Agents’两个关键词。论文评估多个SOTA LLMs在BWIM基准上的表现，研究模型如何通过上下文推理或澄清请求解决歧义，这属于LLM代理在交互环境中的行为研究。其他关键词如MoE、SFT、RAG等均未在论文中涉及，因此评分为0。

!!! tip deepseek-chat TL;DR

该论文研究了大型语言模型在协作积木搭建任务中如何通过上下文推理解决模糊指令的问题，发现模型虽然能检测说话者的不可靠性，但无法有效利用这一信息来指导澄清行为，而是表现出次优策略。

摘要翻译

本研究探究了在协作积木搭建任务中，字面解读与语境推理的分离问题。在该任务中，搭建者必须借助语境推理来解析不明确的指令。基于现有的双说话者心理语言学范式——该范式对比了语用上合作的说话者与仅字面可靠的说话者——我们引入了“搭建我所指”（Build What I Mean, BWIM），这是一个用于语境意义建构的交互式基准测试。在BWIM中，模型必须通过执行语境推理或以较小的沟通成本请求澄清来消除歧义。通过对多个前沿大语言模型（LLMs）的评估，我们发现其判断与行动之间存在分离：尽管模型在明确的置信度评分中能检测到说话者的不可靠性，却未能利用这一信息来指导高效的澄清行为。相反，我们观察到次优的策略，例如无视伙伴特性的过度澄清，以及在不确定性下回避提问的猜测行为。

摘要 (Abstract)

We investigate the separation of literal interpretation from contextual inference in a collaborative block-building task where a builder must resolve underspecified instructions using contextual inferences. Building on an existing two-speaker psycholinguistic paradigm – which contrasts a pragmatically cooperative speaker with one who is only literally reliable – we introduce Build What I Mean (BWIM), an interactive benchmark for contextual meaning construction. In BWIM, models must resolve ambiguity by either performing a contextual inference or requesting clarification at a small communication cost. Evaluating several state-of-the-art LLMs, we find a dissociation between judgment and action: while models detect speaker unreliability in explicit confidence ratings, they fail to exploit this information to guide efficient clarification behavior. Instead, we observe suboptimal strategies, such as partner-blind over-clarification and question-averse guessing under uncertainty.

关键词: Large Language Models, LLM Agents, Interactive Instruction Following, Contextual Inference, Ambiguity Resolution, Clarification Behavior, Benchmark Evaluation, Collaborative Task

105. ❌ Hybrid topic modelling for computational close reading: Mapping narrative themes in Pushkin’s Evgenij Onegin

作者: Angelo Maria Sabatini 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19940v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是计算文学分析中的混合主题建模方法，使用LDA和sPLS-DA分析诗歌叙事主题，属于传统机器学习在人文领域的应用。论文完全不涉及大模型、深度学习、LLM技术原理或AI for Science等关键词，所有关键词均与论文内容完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种结合LDA和sPLS-DA的混合主题建模框架，用于分析普希金诗歌《叶甫盖尼·奥涅金》的叙事主题和情感结构，为计算文学分析提供了可复现的方法。

摘要翻译

本研究提出一种用于计算文学分析的混合主题建模框架，该框架将潜在狄利克雷分配（Latent Dirichlet Allocation，简称LDA）与稀疏偏最小二乘判别分析（sparse Partial Least Squares Discriminant Analysis，简称sPLS-DA）相结合，以建模叙事诗歌中的主题结构与历时动态。作为案例研究，我们使用意大利语译本分析了《叶甫盖尼·奥涅金》——亚历山大·S·普希金的诗体小说——旨在检验在小规模语料库设置中无监督与有监督的词汇结构是否能够趋同。该诗歌文本被分割为35个经过词形还原的实词文档，从中提取出五个稳定且可解释的主题。为应对小规模语料库的不稳定性，研究采用了多种子共识协议。利用sPLS-DA作为有监督探针，通过识别能精炼各主题的词汇标记，增强了模型的可解释性。叙事枢纽——即标记关键情节的连续诗节群——将词袋方法延伸至叙事层面，揭示了主题混合如何与诗歌的情感及结构脉络相呼应。本框架并非旨在取代传统文学阐释，而是提供一种计算化的细读形式，展示了即便在格律、音韵或原生形态等文体特征被抽象化的情况下，轻量级概率模型仍能生成复杂诗歌叙事可复现的主题图谱。尽管本研究依赖于单一词形还原译本，但该方法提供了一个透明的方法学模板，可应用于比较研究中其他高密度文学文本的分析。

摘要 (Abstract)

This study presents a hybrid topic modelling framework for computational literary analysis that integrates Latent Dirichlet Allocation (LDA) with sparse Partial Least Squares Discriminant Analysis (sPLS-DA) to model thematic structure and longitudinal dynamics in narrative poetry. As a case study, we analyse Evgenij Onegin-Aleksandr S. Pushkin’s novel in verse-using an Italian translation, testing whether unsupervised and supervised lexical structures converge in a small-corpus setting. The poetic text is segmented into thirty-five documents of lemmatised content words, from which five stable and interpretable topics emerge. To address small-corpus instability, a multi-seed consensus protocol is adopted. Using sPLS-DA as a supervised probe enhances interpretability by identifying lexical markers that refine each theme. Narrative hubs-groups of contiguous stanzas marking key episodes-extend the bag-of-words approach to the narrative level, revealing how thematic mixtures align with the poem’s emotional and structural arc. Rather than replacing traditional literary interpretation, the proposed framework offers a computational form of close reading, illustrating how lightweight probabilistic models can yield reproducible thematic maps of complex poetic narratives, even when stylistic features such as metre, phonology, or native morphology are abstracted away. Despite relying on a single lemmatised translation, the approach provides a transparent methodological template applicable to other high-density literary texts in comparative studies.

关键词: topic modelling, computational literary analysis, Latent Dirichlet Allocation, sparse Partial Least Squares Discriminant Analysis, narrative poetry, Evgenij Onegin, close reading, thematic structure

106. ❌ Translation from the Information Bottleneck Perspective: an Efficiency Analysis of Spatial Prepositions in Bitexts

作者: Antoine Taroni, Ludovic Moncla, Frederique Laforest 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19924v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究翻译中的信息瓶颈理论应用，分析空间介词在双语文本中的效率，属于认知语言学、计算语言学和信息论交叉领域。论文未涉及任何大模型、深度学习技术原理或AI for Science的具体应用，所有关键词均与论文内容完全无关。

!!! tip deepseek-chat TL;DR

该论文从信息瓶颈理论视角研究翻译，通过分析英语、德语和塞尔维亚语对法语小说中空间介词的翻译，发现实际翻译比替代方案更接近信息瓶颈最优边界，表明人类翻译在空间领域受到交际效率压力。

摘要翻译

高效沟通需要在编码意义时平衡信息量与简洁性。信息瓶颈（Information Bottleneck，简称IB）框架正式捕捉了这种权衡关系，预测自然语言系统会聚集在最优的准确度-复杂度边界附近。虽然该理论在颜色、运动等视觉领域得到了验证，但针对句子语境中的词汇等语言刺激的研究仍属空白。我们通过将翻译构建为一个IB优化问题来填补这一空白，将源语言句子视为刺激信号，目标语言句子视为压缩后的意义表达。这使得IB分析能够直接在双语平行语料上进行，而无需依赖受控的命名实验。我们将此方法应用于一部法语小说的英语、德语和塞尔维亚语译本中的空间介词研究。为估算信息量，我们开展了一项卡片分类试点研究（N=35），获取了介词对之间的相似度判断。我们训练了一个低秩投影模型（维度D=5）来预测这些判断（斯皮尔曼相关系数：0.78）。实证的介词翻译分布比反事实替代方案更接近IB最优边界，这为人类译者在空间领域受到交际效率压力提供了初步证据。更广泛而言，这项工作表明翻译可作为窥探塑造跨语言语义系统的认知效率压力的窗口。

摘要 (Abstract)

Efficient communication requires balancing informativity and simplicity when encoding meanings. The Information Bottleneck (IB) framework captures this trade-off formally, predicting that natural language systems cluster near an optimal accuracy-complexity frontier. While supported in visual domains such as colour and motion, linguistic stimuli such as words in sentential context remain unexplored. We address this gap by framing translation as an IB optimisation problem, treating source sentences as stimuli and target sentences as compressed meanings. This allows IB analyses to be performed directly on bitexts rather than controlled naming experiments. We applied this to spatial prepositions across English, German and Serbian translations of a French novel. To estimate informativity, we conducted a pile-sorting pilot-study (N=35) and obtained similarity judgements of pairs of prepositions. We trained a low-rank projection model (D=5) that predicts these judgements (Spearman correlation: 0.78). Attested translations of prepositions lie closer to the IB optimal frontier than counterfactual alternatives, offering preliminary evidence that human translators exhibit communicative efficiency pressure in the spatial domain. More broadly, this work suggests that translation can serve as a window into the cognitive efficiency pressures shaping cross-linguistic semantic systems.

关键词: Information Bottleneck, translation, spatial prepositions, bitexts, communicative efficiency, cross-linguistic semantics, low-rank projection model, cognitive efficiency

107. ❌ Overreliance on AI in Information-seeking from Video Content

作者: Anders Giovanni Møller, Elisa Bassignana, Francesco Pierri, Luca Maria Aiello 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19843v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文明确研究LLMs在视频信息检索中的应用及其影响（如准确性、效率、过度依赖），因此与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文涉及AI不准确性和虚假答案的风险，这与’Hallucination Mitigation OR Factuality OR Truthfulness’有一定关联（5分），但并非核心解决幻觉问题。其他关键词如MoE、SLMs、训练技术、推理方法、压缩、代理等均未在摘要中提及或相关，故评0分。

!!! tip deepseek-chat TL;DR

该研究探讨了在视频信息检索任务中，使用大型语言模型（LLMs）作为AI助手对准确性、效率和用户信心的影响，发现AI助手能提高准确性和效率，但会导致用户过度依赖，当AI提供虚假答案时准确性大幅下降，而用户信心却保持不变，揭示了AI中介视频检索的安全风险。

摘要翻译

多媒体内容的泛在性正在重塑在线信息空间，尤其在社交媒体环境中。与此同时，生成式人工智能正迅速改变搜索模式，大型语言模型（LLMs）作为用户与多媒体内容之间的中介被常规部署，用于检索和总结信息。尽管其影响力日益增强，但LLMs的不准确性和潜在脆弱性对多媒体信息搜索任务的影响仍很大程度上未被探索。本研究探究了生成式人工智能如何影响从视频中检索信息的准确性、效率和用户信心。我们开展了一项涉及约900名参与者的实验，基于8000多项视频信息搜索任务，比较了三种情境下的行为：（1）仅能访问视频；（2）能访问视频并辅以基于LLM的人工智能助手；（3）能访问视频但配备一个旨在提供错误答案的欺骗性AI助手。研究发现，当参与者观看了相关视频片段时，AI辅助可将准确率提高3-7%；若未观看视频，准确率提升幅度达27-35%。在效率方面，短视频处理效率提升10%，长视频提升25%。然而，参与者倾向于过度依赖AI输出，在与欺骗性AI交互时，准确率下降幅度最高达32%。令人警惕的是，参与者在所有三种情境中自我报告的回答信心水平保持稳定。我们的研究结果揭示了AI中介视频信息检索中存在的根本性安全风险。

摘要 (Abstract)

The ubiquity of multimedia content is reshaping online information spaces, particularly in social media environments. At the same time, search is being rapidly transformed by generative AI, with large language models (LLMs) routinely deployed as intermediaries between users and multimedia content to retrieve and summarize information. Despite their growing influence, the impact of LLM inaccuracies and potential vulnerabilities on multimedia information-seeking tasks remains largely unexplored. We investigate how generative AI affects accuracy, efficiency, and confidence in information retrieval from videos. We conduct an experiment with around 900 participants on 8,000+ video-based information-seeking tasks, comparing behavior across three conditions: (1) access to videos only, (2) access to videos with LLM-based AI assistance, and (3) access to videos with a deceiving AI assistant designed to provide false answers. We find that AI assistance increases accuracy by 3-7% when participants viewed the relevant video segment, and by 27-35% when they did not. Efficiency increases by 10% for short videos and 25% for longer ones. However, participants tend to over-rely on AI outputs, resulting in accuracy drops of up to 32% when interacting with the deceiving AI. Alarmingly, self-reported confidence in answers remains stable across all three conditions. Our findings expose fundamental safety risks in AI-mediated video information retrieval.

关键词: Large Language Models, AI assistance, video information retrieval, overreliance, accuracy, efficiency, confidence, safety risks

108. ❌ Borderless Long Speech Synthesis

作者: Xingchen Song, Di Wu, Dinghao Zhou, Pengyu Cheng, Hongwu Ding, Yunchao He, Jie Wang, Shengfan Shen, Sixiang Lv, Lichun Fan, Hang Su, Yifeng Wang, Shuai Wang, Meng Meng, Jian Luan 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19798v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种基于LLM代理的边界长语音合成框架，明确使用了LLM作为前端控制器，并集成了Chain-of-Thought推理来改善指令跟随。因此，与’Large Language Models’和’Chain of Thought’高度相关（10分），与’LLM Agents’高度相关（10分），因为系统被设计为原生代理架构。其他关键词如MoE、SFT、RAG等未在摘要中提及或与核心内容无关，故得0分。

!!! tip deepseek-chat TL;DR

该研究解决了传统文本转语音系统缺乏全局上下文理解和副语言线索的问题，通过引入一个基于LLM代理的边界长语音合成框架，结合Chain-of-Thought推理和分层标注策略，实现了从多模态输入到结构化生成命令的转换，显著提升了复杂条件下的指令跟随能力。

摘要翻译

现有的大多数文本转语音系统要么逐句合成语音后拼接结果，要么仅基于纯文本对话驱动合成。这两种方法都使模型难以理解全局语境或副语言线索，从而无法有效捕捉现实世界中的多说话人交互（如打断、重叠语音）、动态情感演进以及多样化声学环境等现象。我们提出了以智能体为中心的无边界长语音合成框架——Borderless Long Speech Synthesis。该系统并非针对单一狭窄任务，而是设计为一套统一的能力集合，涵盖语音设计器、多说话人合成、指令型文本转语音以及长文本合成。在数据层面，我们提出“标注优先于过滤/清洗”策略，并设计了一种自上而下的多层次标注体系，称为“全局-句子-词元”三级标注架构。在模型层面，我们采用配备连续分词器的骨干网络，并引入思维链推理机制与维度随机丢弃技术，两者均显著提升了复杂条件下的指令跟随能力。我们进一步证明该系统具有原生智能体特性：分层标注体系同时充当大语言模型智能体与合成引擎间的结构化语义接口，构建出从场景语义延伸至语音细节的分层控制协议栈。文本由此成为信息完备的宽带控制通道，使前端大语言模型能够将任意模态的输入转化为结构化生成指令，从而将范式从文本转语音拓展至无边界长语音合成。

摘要 (Abstract)

Most existing text-to-speech (TTS) systems either synthesize speech sentence by sentence and stitch the results together, or drive synthesis from plain-text dialogues alone. Both approaches leave models with little understanding of global context or paralinguistic cues, making it hard to capture real-world phenomena such as multi-speaker interactions (interruptions, overlapping speech), evolving emotional arcs, and varied acoustic environments. We introduce the Borderless Long Speech Synthesis framework for agent-centric, borderless long audio synthesis. Rather than targeting a single narrow task, the system is designed as a unified capability set spanning VoiceDesigner, multi-speaker synthesis, Instruct TTS, and long-form text synthesis. On the data side, we propose a “Labeling over filtering/cleaning” strategy and design a top-down, multi-level annotation schema we call Global-Sentence-Token. On the model side, we adopt a backbone with a continuous tokenizer and add Chain-of-Thought (CoT) reasoning together with Dimension Dropout, both of which markedly improve instruction following under complex conditions. We further show that the system is Native Agentic by design: the hierarchical annotation doubles as a Structured Semantic Interface between the LLM Agent and the synthesis engine, creating a layered control protocol stack that spans from scene semantics down to phonetic detail. Text thereby becomes an information-complete, wide-band control channel, enabling a front-end LLM to convert inputs of any modality into structured generation commands, extending the paradigm from Text2Speech to borderless long speech synthesis.

关键词: Borderless Long Speech Synthesis, LLM Agent, Chain-of-Thought reasoning, multi-speaker synthesis, Instruct TTS, hierarchical annotation, structured semantic interface, text-to-speech

109. ❌ Neither Here Nor There: Cross-Lingual Representation Dynamics of Code-Mixed Text in Multilingual Encoders

作者: Debajyoti Mazumder, Divyansh Pathak, Prashant Kodali, Jasabanta Patro 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19771v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	8.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究多语言编码器对代码混合文本的内部表示，主要涉及预训练模型的分析和微调方法。与以下关键词高度相关：1) “Post-training OR Supervised Fine-tuning OR SFT”（10分）- 论文提出并实施了trilingual post-training alignment objective；2) “Pre-training OR Continual Pre-training OR Domain Adaptation”（8分）- 研究了continued pre-training on code-mixed data的影响；3) “Instruction Tuning OR Alignment OR Value Alignment”（8分）- 提出了alignment objective来改善表示对齐；4) “Mechanistic Interpretability OR Explainable AI”（8分）- 使用CKA、saliency、entropy等方法进行可解释性分析。其他关键词如LLMs、MoE、RAG等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文研究了多语言编码器对印地语-英语代码混合文本的内部表示，发现标准模型对代码混合输入的表示与构成语言松散连接，并提出了一种三语言后训练对齐目标，显著改善了跨语言对齐并提升了情感分析和仇恨言论检测的下游任务性能。

摘要翻译

基于多语言编码器的语言模型已被广泛用于语码混合分析任务，然而我们对其内部如何表征语码混合输入——或这些表征是否与混合的组成语言存在有意义的关联——却知之甚少。以印地语-英语为案例，我们构建了一个统一的平行三语语料库，包含英语、印地语（天城文）及罗马化语码混合句子，并通过中心核对齐分析（CKA）、词元级显著性分析和基于熵的不确定性分析，探究了标准多语言编码器及其语码混合适应变体之间的跨语言表征对齐情况。研究发现，虽然标准模型能较好地对齐英语和印地语，但语码混合输入与任一种语言的关联仍较为松散；而在语码混合数据上继续预训练虽能改善英语与语码混合之间的对齐，却会以牺牲英语-印地语对齐为代价。可解释性分析进一步揭示了一种明显的不对称性：模型通过一个以英语为主导的语义子空间处理语码混合文本，而原生文字的印地语则提供互补信号以降低表征不确定性。基于这些发现，我们提出了一种三语后训练对齐目标，使语码混合表征同时更接近两种组成语言，从而实现了更均衡的跨语言对齐，并在情感分析和仇恨言论检测下游任务中取得性能提升——这表明将语码混合表征锚定于其组成语言能有效促进跨语言理解。

摘要 (Abstract)

Multilingual encoder-based language models are widely adopted for code-mixed analysis tasks, yet we know surprisingly little about how they represent code-mixed inputs internally - or whether those representations meaningfully connect to the constituent languages being mixed. Using Hindi-English as a case study, we construct a unified trilingual corpus of parallel English, Hindi (Devanagari), and Romanized code-mixed sentences, and probe cross-lingual representation alignment across standard multilingual encoders and their code-mixed adapted variants via CKA, token-level saliency, and entropy-based uncertainty analysis. We find that while standard models align English and Hindi well, code-mixed inputs remain loosely connected to either language - and that continued pre-training on code-mixed data improves English-code-mixed alignment at the cost of English-Hindi alignment. Interpretability analyses further reveal a clear asymmetry: models process code-mixed text through an English-dominant semantic subspace, while native-script Hindi provides complementary signals that reduce representational uncertainty. Motivated by these findings, we introduce a trilingual post-training alignment objective that brings code-mixed representations closer to both constituent languages simultaneously, yielding more balanced cross-lingual alignment and downstream gains on sentiment analysis and hate speech detection - showing that grounding code-mixed representations in their constituent languages meaningfully helps cross-lingual understanding.

关键词: multilingual encoders, code-mixed text, cross-lingual representation alignment, post-training alignment, Hindi-English, interpretability analysis, sentiment analysis, hate speech detection

110. ❌ Rethinking Ground Truth: A Case Study on Human Label Variation in MLLM Benchmarking

作者: Tomas Ruiz, Tanalp Agustoslu, Carsten Schwemmer 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19744v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文主要研究多模态大语言模型（MLLM）在基准测试中的人类标签变异问题，核心关注LLM评估方法，因此仅与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分）。论文未涉及其他关键词的具体技术（如MoE、量化、推理加速等）或应用领域（如生物信息学），故其他关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文研究了人类标签变异对多模态大语言模型基准测试的影响，发现仅基于共识标签的评估会高估模型能力，而考虑标签差异能提供更真实、鲁棒的模型评估。

摘要翻译

人类标注差异（Human Label Variation, HLV），即标注者判断之间的系统性差异，尽管大语言模型（LLM）发展迅速，但在基准测试中仍未得到充分探索。为填补这一空白，我们提出了一种用于多模态大语言模型（MLLM）基准测试的评估方案，该方案明确考虑了两种条件：（1）人类标注一致性；（2）标注分歧。我们基于一个社交媒体内容分类数据集的非聚合人工标注数据，将此方案应用于两个先进的多模态大语言模型系列（Gemma 3、Qwen 2.5 VL）。跨任务分析发现，在标注一致性高的数据子集上，较大模型往往表现最佳；但在人类标注分歧较高时，其表现常逊于中等规模模型，这表明仅靠参数量并不能决定模型对模糊性和主观性的敏感度。这些结果表明，仅基于共识标签的基准测试可能会高估模型在此类领域的能力，而纳入人类标注差异能为内容审核流程中的多模态大语言模型提供更现实、更稳健的评估。

摘要 (Abstract)

Human Label Variation (HLV), i.e. systematic differences among annotators’ judgments, remains underexplored in benchmarks despite rapid progress in large language model (LLM) development. We address this gap by introducing an evaluation protocol for multimodal large language model (MLLM) benchmarking that explicitly accounts for two conditions: (1) human label agreement and (2) disagreement. We apply this protocol to two state-of-the-art MLLM families (Gemma 3, Qwen 2.5 VL) using non-aggregated human annotations from a social media content classification dataset. Across tasks, we find that larger models tend to perform best on high-agreement subsets, yet often underperform medium-sized models when human disagreement is high, indicating that parameter count alone does not determine sensitivity to ambiguity and subjectivity. These results show that benchmarks based solely on consensus labels can overstate model capabilities in such domains and that incorporating human label variation yields more realistic and robust assessments of MLLMs in content moderation pipelines.

关键词: Human Label Variation, Multimodal Large Language Models, Benchmarking, Evaluation Protocol, Content Moderation, Model Performance, Ambiguity Sensitivity, Social Media Classification

111. ❌ Dual Path Attribution: Efficient Attribution for SwiGLU-Transformers through Layer-Wise Target Propagation

作者: Lasse Marten Jantsch, Dong-Jae Koh, Seonghyeon Lee, Young-Kyoon Suh 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19742v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于大语言模型（LLMs）的可解释性研究，提出了Dual Path Attribution（DPA）框架来高效追踪SwiGLU-Transformers内部的信息流，因此与’Large Language Models OR LLMs OR Foundation Models’和’Mechanistic Interpretability OR Explainable AI’高度相关（10分）。论文的核心是模型内部机制的理解和归因方法，不涉及其他关键词如MoE、训练方法、推理加速、代理系统、科学AI应用等，因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对Transformer大语言模型内部机制理解困难的问题，提出了一种名为Dual Path Attribution（DPA）的新型归因框架，能够在一次前向和一次反向传播中高效、忠实地追踪信息流，在标准可解释性基准测试中实现了最先进的忠实性和前所未有的效率。

摘要翻译

理解基于Transformer架构的大语言模型（LLM）的内部机制，对于其可靠部署与高效运行至关重要。尽管近期研究已催生出大量试图在忠实性与计算效率之间取得平衡的归因方法，但密集组件归因的计算成本仍然过高。本文提出了一种新颖的框架——双路径归因（Dual Path Attribution, DPA），该框架能够在无需反事实示例的情况下，通过一次前向传播和一次反向传播，忠实追踪冻结Transformer模型中的信息流动。DPA将SwiGLU Transformer的计算结构解析分解并线性化为不同的路径，并沿着这些路径传播一个目标解嵌入向量，以获取每个残差位置的有效表征。这种以目标为中心的传播方式实现了与模型组件数量无关的O(1)时间复杂度，可扩展至长输入序列和密集组件归因任务。在标准可解释性基准测试上的大量实验表明，与现有基线方法相比，DPA在忠实性方面达到了最先进的水平，并实现了前所未有的效率。

摘要 (Abstract)

Understanding the internal mechanisms of transformer-based large language models (LLMs) is crucial for their reliable deployment and effective operation. While recent efforts have yielded a plethora of attribution methods attempting to balance faithfulness and computational efficiency, dense component attribution remains prohibitively expensive. In this work, we introduce Dual Path Attribution (DPA), a novel framework that faithfully traces information flow on the frozen transformer in one forward and one backward pass without requiring counterfactual examples. DPA analytically decomposes and linearizes the computational structure of the SwiGLU Transformers into distinct pathways along which it propagates a targeted unembedding vector to receive the effective representation at each residual position. This target-centric propagation achieves O(1) time complexity with respect to the number of model components, scaling to long input sequences and dense component attribution. Extensive experiments on standard interpretability benchmarks demonstrate that DPA achieves state-of-the-art faithfulness and unprecedented efficiency compared to existing baselines.

关键词: Large Language Models, Transformer, Interpretability, Attribution Methods, SwiGLU, Information Flow, Faithfulness, Computational Efficiency

112. ❌ FedPDPO: Federated Personalized Direct Preference Optimization for Large Language Model Alignment

作者: Kewen Zhu, Liping Yi, Zhiming Zhao, Zhuang Qi, Han Yu, Qinghua Hu 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19741v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	10.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	10.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM对齐问题，提出FedPDPO框架，直接涉及LLMs、Alignment、DPO、LoRA等关键词，这些是论文的核心技术和方法，因此给10分。其他关键词如MoE、SLMs、RAG等未在论文中提及，给0分。

!!! tip deepseek-chat TL;DR

该论文针对联邦学习中非独立同分布数据下LLM对齐的挑战，提出了FedPDPO框架，通过个性化DPO训练和LoRA适配器，在多个偏好数据集上实现了最先进的性能，平均准确率提升达4.80%。

摘要翻译

在联邦学习（FL）环境中，由于偏好数据具有去中心化、隐私敏感且高度非独立同分布（non-IID）的特性，将大语言模型（LLMs）与人类偏好对齐面临挑战。直接偏好优化（Direct Preference Optimization, DPO）为基于人类反馈的强化学习（RLHF）提供了一种高效替代方案，但其在联邦学习中直接应用时，会在非IID数据下出现严重的性能下降，且其隐式奖励的泛化能力有限。为弥补这一差距，我们提出了FedPDPO（联邦个性化直接偏好优化），一个用于大语言模型偏好对齐的个性化联邦学习框架。该框架采用参数高效微调架构，其中每个客户端维护一个冻结的预训练大语言模型主干，并辅以低秩适配（Low-Rank Adaptation, LoRA）适配器，从而实现通信高效的聚合。为应对非IID异质性，我们设计了（1）全局共享的LoRA适配器与个性化的客户端特定大语言模型头部相结合的结构。此外，我们引入了（2）一种个性化的DPO训练策略，该策略配备客户端特定的显式奖励头部，以补充隐式奖励并进一步缓解非IID异质性；以及（3）一个瓶颈适配器，以平衡全局与局部特征。我们提供了理论分析，确立了其概率基础与合理性。在多个偏好数据集上的大量实验证明了其最先进的性能，在联邦域内和跨域设置中实现了高达4.80%的平均准确率提升。

摘要 (Abstract)

Aligning large language models (LLMs) with human preferences in federated learning (FL) is challenging due to decentralized, privacy-sensitive, and highly non-IID preference data. Direct Preference Optimization (DPO) offers an efficient alternative to reinforcement learning with human feedback (RLHF), but its direct application in FL suffers from severe performance degradation under non-IID data and limited generalization of implicit rewards. To bridge this gap, we propose FedPDPO (Federated Personalized Direct Preference Optimization), a personalized federated framework for preference alignment of LLMs. It adopts a parameter-efficient fine-tuning architecture where each client maintains a frozen pretrained LLM backbone augmented with a Low-Rank Adaptation (LoRA) adapter, enabling communication-efficient aggregation. To address non-IID heterogeneity, we devise (1) the globally shared LoRA adapter with the personalized client-specific LLM head. Moreover, we introduce (2) a personalized DPO training strategy with a client-specific explicit reward head to complement implicit rewards and further alleviate non-IID heterogeneity, and (3) a bottleneck adapter to balance global and local features. We provide theoretical analysis establishing the probabilistic foundation and soundness. Extensive experiments on multiple preference datasets demonstrate state-of-the-art performance, achieving up to 4.80% average accuracy improvements in federated intra-domain and cross-domain settings.

关键词: Federated Learning, Large Language Models, Direct Preference Optimization, Personalization, LoRA, Non-IID Data, Alignment, Parameter-efficient Fine-tuning

113. ❌ LoopRPT: Reinforcement Pre-Training for Looped Language Models

作者: Guo Tang, Shixin Jiang, Heng Chang, Nuo Chen, Yuhan Li, Huiming Fan, Jia Li, Ming Liu, Bing Qin 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19714v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出LoopRPT框架，专门针对循环语言模型（LoopLMs）的强化预训练，核心涉及大模型（LLMs）、预训练、强化学习（RLHF/RL相关）、链式推理（CoT）和深度推理（System 2 Thinking）等关键词，这些是论文的核心内容，因此给予高分（10分）。其他关键词如MoE、量化、RAG等未在论文中涉及，给予0分。

!!! tip deepseek-chat TL;DR

论文针对循环语言模型（LoopLMs）中强化学习与隐式推理结构不匹配的问题，提出了LoopRPT强化预训练框架，通过将下一个token预测重构为下一个token推理任务，直接对潜在步骤分配强化信号，从而提升表示质量并在准确性与计算效率之间实现帕累托优势。

摘要翻译

循环语言模型通过迭代潜在计算来优化内部表征，为显式思维链推理提供了一种有前景的替代方案。然而，现有的强化学习范式主要针对输出词元，与循环架构中推理隐式展开的特性存在结构错配。本研究提出LoopRPT——一个专为循环语言模型设计的强化预训练框架。通过将下一词元预测重构为下一词元推理任务，LoopRPT采用指数移动平均教师参考与带噪潜在展开轨迹，直接将强化信号分配至潜在计算步骤。该框架使强化学习能够直接塑造中间表征，将有效推理压缩至更少的迭代次数。我们在不同规模Ouro架构上实例化了LoopRPT。实验结果表明，LoopRPT持续提升每步表征质量，在准确率与计算成本的权衡中实现帕累托最优。值得注意的是，模型在困难词元上的显著提升表明，LoopRPT增强了早期阶段的推理能力，而非仅仅促使模型提前退出。我们的研究结果凸显了强化预训练作为学习循环语言模型中高效潜在推理的一种原则性范式。

摘要 (Abstract)

Looped language models (LoopLMs) perform iterative latent computation to refine internal representations, offering a promising alternative to explicit chain-of-thought (CoT) reasoning. However, existing reinforcement learning (RL) paradigms primarily target output tokens, creating a structural mismatch with looped architectures whose reasoning unfolds implicitly. In this work, we propose LoopRPT, a reinforcement pre-training framework tailored for LoopLMs. By reframing next-token prediction as a next-token reasoning task, LoopRPT assigns reinforcement signals directly to latent steps using an EMA teacher reference and noisy latent rollouts. This formulation enables RL to directly shape intermediate representations, compressing effective reasoning into fewer iterations. We instantiate LoopRPT on the Ouro architecture across multiple model scales. Results demonstrate that LoopRPT consistently improves per-step representation quality, achieving Pareto dominance in accuracy-computation trade-offs. Notably, significant gains on hard tokens indicate that LoopRPT enhances early-stage reasoning rather than merely encouraging premature exits. Our findings highlight reinforcement pre-training as a principled paradigm for learning efficient latent reasoning in LoopLMs.

关键词: LoopRPT, reinforcement pre-training, looped language models, latent reasoning, Chain-of-Thought, representation quality, Ouro architecture, accuracy-computation trade-offs

114. ❌ TAB-AUDIT: Detecting AI-Fabricated Scientific Tables via Multi-View Likelihood Mismatch

作者: Shuo Huang, Yan Pen, Lizhen Qu 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19712v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究AI生成科学表格的检测方法，属于大模型在科学领域的应用（AI for Science），因此与’AI for Science OR Bioinformatics OR Cheminformatics’相关（5分）。论文涉及检测AI生成内容，与’Hallucination Mitigation OR Factuality OR Truthfulness’有一定关联（5分）。论文提到AI-generated fabricated scientific manuscripts，暗示使用大模型生成内容，因此与’Large Language Models OR LLMs OR Foundation Models’有间接关联（5分）。其他关键词主要涉及大模型技术原理、训练方法、推理优化等，论文未涉及这些具体技术，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了TAB-AUDIT框架，通过多视图似然不匹配检测AI生成的伪造科学表格，在FabTab基准上实现了0.987 AUROC的检测性能，为识别AI生成的学术欺诈提供了新方法。

摘要翻译

人工智能生成的伪造科学手稿因大规模破坏学术诚信而引发日益增长的担忧。本研究首次系统性地探讨了在实证自然语言处理（NLP）论文中检测人工智能生成的伪造科学表格的方法，因为表格中的信息是支撑学术主张的关键证据。我们构建了首个包含表格的伪造手稿基准数据集FabTab，该数据集包含实证NLP领域的1,173篇人工智能生成论文和1,215篇人类撰写论文。通过全面分析，我们识别了伪造表格与真实表格之间的系统性差异，并将这些差异在TAB-AUDIT框架中转化为一系列可区分的特征。其中关键特征——表内不一致性，捕捉了表格框架与其数值内容之间的困惑度差距。实验结果表明，基于这些特征构建的随机森林模型显著优于现有最先进方法，在领域内达到0.987 AUROC，在领域外达到0.883 AUROC。我们的研究结果表明，实验表格可作为检测人工智能生成科学欺诈的关键取证信号，并为未来研究提供了新的基准。

摘要 (Abstract)

AI-generated fabricated scientific manuscripts raise growing concerns with large-scale breaches of academic integrity. In this work, we present the first systematic study on detecting AI-generated fabricated scientific tables in empirical NLP papers, as information in tables serve as critical evidence for claims. We construct FabTab, the first benchmark dataset of fabricated manuscripts with tables, comprising 1,173 AI-generated papers and 1,215 human-authored ones in empirical NLP. Through a comprehensive analysis, we identify systematic differences between fabricated and real tables and operationalize them into a set of discriminative features within the TAB-AUDIT framework. The key feature, within-table mismatch, captures the perplexity gap between a table’s skeleton and its numerical content. Experimental results show that RandomForest built on these features significantly outperform prior state-of-the-art methods, achieving 0.987 AUROC in-domain and 0.883 AUROC out-of-domain. Our findings highlight experimental tables as a critical forensic signal for detecting AI-generated scientific fraud and provide a new benchmark for future research.

关键词: AI-generated tables, scientific fraud detection, multi-view likelihood mismatch, forensic analysis, NLP papers, benchmark dataset, RandomForest classifier, academic integrity

作者: Yiyang Li, Tianyi Ma, Yanfang Ye 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19711v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文提出EvoTaxo框架，明确使用LLM将社交媒体帖子转换为结构化草案，属于大模型在文本处理领域的应用创新，因此与’Large Language Models’高度相关（8分）。论文涉及从社交媒体构建分类法，可视为AI在社会科学或信息科学中的应用，与’AI for Science’有一定关联（5分）。论文未涉及其他关键词的具体技术原理或应用，如MoE、训练方法、推理优化、代理系统等，因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对从动态社交媒体流中构建和演化分类法的挑战，提出了一个基于LLM的EvoTaxo框架，通过将帖子转换为结构化草案、双视图聚类和编辑仲裁，实验表明其能产生更平衡的分类法并捕捉有意义的时序话语变化。

摘要翻译

从社交媒体语料库中构建分类体系具有挑战性，因为帖子内容短小、噪声多、语义纠缠且具有时间动态性。现有的分类体系归纳方法大多为静态语料库设计，往往难以在鲁棒性、可扩展性以及对动态演变的讨论内容的敏感性之间取得平衡。我们提出了EvoTaxo，这是一个基于大语言模型（LLM）的框架，用于从按时间顺序排列的社交媒体流中构建并演化分类体系。EvoTaxo不直接对原始帖子进行聚类，而是将每个帖子转化为针对当前分类体系的结构化草稿操作，在时间窗口内累积结构证据，并通过结合语义相似性与时间局部性的双视图聚类来整合候选编辑操作。随后，一个优化与仲裁程序在执行前筛选出可靠的编辑，同时每个节点维护一个概念记忆库，以长期保持语义边界。在两个Reddit语料库上的实验表明，与基线方法相比，EvoTaxo能生成更平衡的分类体系，具有更清晰的帖子到叶子节点的分配关系，在可比分类体系规模下获得更好的语料覆盖率，以及更强的结构质量。在Reddit社区/r/ICE_Raids上的案例研究进一步表明，EvoTaxo能够捕捉到讨论中有意义的时间性演变。我们的代码库已公开。

摘要 (Abstract)

Constructing taxonomies from social media corpora is challenging because posts are short, noisy, semantically entangled, and temporally dynamic. Existing taxonomy induction methods are largely designed for static corpora and often struggle to balance robustness, scalability, and sensitivity to evolving discourse. We propose EvoTaxo, a LLM-based framework for building and evolving taxonomies from temporally ordered social media streams. Rather than clustering raw posts directly, EvoTaxo converts each post into a structured draft action over the current taxonomy, accumulates structural evidence over time windows, and consolidates candidate edits through dual-view clustering that combines semantic similarity with temporal locality. A refinement-and-arbitration procedure then selects reliable edits before execution, while each node maintains a concept memory bank to preserve semantic boundaries over time. Experiments on two Reddit corpora show that EvoTaxo produces more balanced taxonomies than baselines, with clearer post-to-leaf assignment, better corpus coverage at comparable taxonomy size, and stronger structural quality. A case study on the Reddit community /r/ICE_Raids further shows that EvoTaxo captures meaningful temporal shifts in discourse. Our codebase is available here.

关键词: Taxonomy Induction, Social Media Streams, LLM-based Framework, Temporal Dynamics, Dual-view Clustering, Concept Memory Bank, Reddit Corpora, Discourse Evolution

116. ❌ Structured Prompting for Arabic Essay Proficiency: A Trait-Centric Evaluation Approach

作者: Salim Al Mandhari, Hieu Pham Dinh, Mo El-Haj, Paul Rayson 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19668v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	5.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	5.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	5.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是使用大语言模型（LLMs）进行阿拉伯语自动作文评分（AES），因此与’Large Language Models’高度相关（10分）。论文采用零样本和少样本配置，与’In-context Learning’相关（5分）。论文的混合方法模拟多智能体评估，与’LLM Agents’和’Multi-agent Systems’相关（各5分）。论文提到’enhance model alignment’，与’Instruction Tuning OR Alignment’有一定关联（5分）。其他关键词如MoE、SLMs、Scaling Laws、Pre-training、RLHF、RAG、Quantization等均未在论文中涉及，评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于结构化提示工程的阿拉伯语自动作文评分框架，通过零样本和少样本配置评估了八种大语言模型，发现结构化提示（而非模型规模）能有效提升阿拉伯语作文评分的准确性，其中Fanar-1-9B-Instruct模型表现最佳。

摘要翻译

本文提出了一种新颖的提示工程框架，用于阿拉伯语特质导向的自动作文评分（Automatic Essay Scoring, AES），该框架在零样本和少样本配置下利用大语言模型（Large Language Models, LLMs）。针对阿拉伯语领域缺乏可扩展且具备语言学洞察的AES工具的问题，我们引入了一种三层提示策略（标准、混合和评分标准引导），以指导LLMs评估不同的语言能力特质，如组织结构、词汇、内容展开和文体风格。混合方法通过模拟具有特质专家评分员的多智能体评估来实现，而评分标准引导方法则通过引入已评分的范例来增强模型的对齐能力。在零样本和少样本设置下，我们在QAES数据集上评估了八个LLMs，该数据集是首个公开可用的、具有特质层面标注的阿拉伯语AES资源。使用二次加权卡帕系数（Quadratic Weighted Kappa, QWK）和置信区间（Confidence Intervals）的实验结果表明，Fanar-1-9B-Instruct模型在零样本和少样本提示下均实现了最高的特质层面一致性（QWK = 0.28，CI = 0.41），且评分标准引导提示在所有特质和模型中都带来了稳定的性能提升。语篇层面的特质，如内容展开和文体风格，显示出最大的改进。这些发现证实，结构化的提示策略（而非仅依赖模型规模）能够实现有效的阿拉伯语AES。我们的研究提出了首个面向能力培养的阿拉伯语AES综合框架，并为低资源教育环境下的可扩展评估奠定了基础。

摘要 (Abstract)

This paper presents a novel prompt engineering framework for trait specific Automatic Essay Scoring (AES) in Arabic, leveraging large language models (LLMs) under zero-shot and few-shot configurations. Addressing the scarcity of scalable, linguistically informed AES tools for Arabic, we introduce a three-tier prompting strategy (standard, hybrid, and rubric-guided) that guides LLMs in evaluating distinct language proficiency traits such as organization, vocabulary, development, and style. The hybrid approach simulates multi-agent evaluation with trait specialist raters, while the rubric-guided method incorporates scored exemplars to enhance model alignment. In zero and few-shot settings, we evaluate eight LLMs on the QAES dataset, the first publicly available Arabic AES resource with trait level annotations. Experimental results using Quadratic Weighted Kappa (QWK) and Confidence Intervals show that Fanar-1-9B-Instruct achieves the highest trait level agreement in both zero and few-shot prompting (QWK = 0.28 and CI = 0.41), with rubric-guided prompting yielding consistent gains across all traits and models. Discourse-level traits such as Development and Style showed the greatest improvements. These findings confirm that structured prompting, not model scale alone, enables effective AES in Arabic. Our study presents the first comprehensive framework for proficiency oriented Arabic AES and sets the foundation for scalable assessment in low resource educational contexts.

关键词: Automatic Essay Scoring, Arabic language, Large Language Models, Prompt Engineering, Zero-shot Learning, Few-shot Learning, Trait-specific Evaluation, QAES Dataset

117. ❌ BEAVER: A Training-Free Hierarchical Prompt Compression Method via Structure-Aware Page Selection

作者: Zhengpei Hu, Kai Li, Dapeng Fu, Chang Zeng, Yue Li, Yuanhao Tang, Jianqiang Huang 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19635v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	10.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文BEAVER专注于长上下文LLMs的推理效率问题，提出了一种无需训练的分层提示压缩方法。与"Large Language Models"和"Context Window Extension"高度相关（10分），因为论文直接针对LLMs的长上下文扩展带来的瓶颈。与"Speculative Decoding OR Inference Acceleration"有一定关联（5分），因为该方法旨在减少推理延迟，但并非直接加速解码过程。其他关键词如MoE、SFT、RAG等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对LLMs长上下文扩展导致的推理延迟和信息利用瓶颈，提出了一种无需训练的分层提示压缩方法BEAVER，在保持性能的同时显著降低了延迟。

摘要翻译

大型语言模型（LLM）上下文窗口的指数级扩展释放了长文档理解的能力，但也带来了推理延迟和信息利用效率的严重瓶颈。现有的压缩方法常因激进的令牌剪枝而面临高昂的训练成本或语义碎片化问题。本文提出BEAVER，一种无需训练的新型框架，它将压缩从线性的令牌删除转向结构感知的层次化选择。BEAVER通过双路池化将可变长度上下文映射为密集的页面级张量，从而最大化硬件并行性；同时，通过一个结合语义与词汇双分支选择及句子平滑的混合规划器，保持语篇完整性。在四个长上下文基准测试上的广泛评估表明，BEAVER达到了与LongLLMLingua等最先进（SOTA）方法相当的性能。值得注意的是，在RULER基准测试中，BEAVER在多针检索任务中保持了高保真度，而基线方法则表现衰退。在效率方面，BEAVER在128k上下文长度上将延迟降低了26.4倍，为高吞吐量应用提供了一个可扩展的解决方案。我们的代码发布于https://cslikai.cn/BEAVER/。

摘要 (Abstract)

The exponential expansion of context windows in LLMs has unlocked capabilities for long-document understanding but introduced severe bottlenecks in inference latency and information utilization. Existing compression methods often suffer from high training costs or semantic fragmentation due to aggressive token pruning. In this paper, we propose BEAVER, a novel training-free framework that shifts compression from linear token removal to structure-aware hierarchical selection. BEAVER maximizes hardware parallelism by mapping variable-length contexts into dense page-level tensors via dual-path pooling, and preserves discourse integrity through a hybrid planner combining semantic and lexical dual-branch selection with sentence smoothing. Extensive evaluations on four long-context benchmarks demonstrate that BEAVER achieves comparable performance to state-of-the-art (SOTA) methods like LongLLMLingua. Notably, on the RULER benchmark, BEAVER maintains high fidelity in multi-needle retrieval where baselines deteriorate. Regarding efficiency, BEAVER reduces latency by 26.4x on 128k contexts, offering a scalable solution for high-throughput applications. Our code is available at https://cslikai.cn/BEAVER/.

关键词: LLMs, context window, prompt compression, training-free, inference latency, long-document understanding, hierarchical selection, structure-aware

118. ❌ MME-CoF-Pro: Evaluating Reasoning Coherence in Video Generative Models with Text and Visual Hints

作者: Yu Qi, Xinyi Xu, Ziyu Guo, Siyuan Ma, Renrui Zhang, Xinyan Chen, Ruichuan An, Ruofan Xing, Jiayi Zhang, Haojie Huang, Pheng-Ann Heng, Jonathan Tremblay, Lawson L. S. Wong 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20194v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于视频生成模型的推理一致性评估，提出了MME-CoF-Pro基准。与大多数关键词无关，因为论文不涉及大模型技术原理、训练方法、优化技术或特定领域应用。仅与少数关键词有间接关联：‘Chain of Thought’和’System 2 Thinking’（涉及推理过程评估，5分），‘Hallucination Mitigation’（涉及幻觉检测，5分），‘Explainable AI’（涉及模型行为分析，5分）。其他关键词均未涉及。

!!! tip deepseek-chat TL;DR

该研究提出了MME-CoF-Pro基准来评估视频生成模型的推理一致性，发现现有模型推理一致性较弱，文本提示会引发不一致和幻觉，而视觉提示对结构化感知任务有益但细粒度感知困难。

摘要翻译

视频生成模型展现出新兴的推理能力。为确保生成事件在帧间保持因果一致性以实现可靠部署——这一特性我们定义为推理连贯性——至关重要。为填补当前文献中推理连贯性评估的空白，我们提出了MME-CoF-Pro，这是一个用于评估视频模型推理连贯性的综合性视频推理基准。具体而言，MME-CoF-Pro包含16个类别的303个样本，涵盖从视觉逻辑到科学推理的广泛范畴。它引入了推理分数作为评估指标，用于衡量过程层面必要的中间推理步骤，并包含三种评估设置：（a）无提示（b）文本提示和（c）视觉提示，从而能够对推理提示引导的内在机制进行受控研究。对7个开源和闭源视频模型的评估结果揭示了以下发现：（1）视频生成模型表现出较弱的推理连贯性，且与生成质量脱钩。（2）文本提示虽能提升表面正确性，但常导致不一致和幻觉推理。（3）视觉提示有助于结构化感知任务，但在细粒度感知方面存在困难。项目网站：https://video-reasoning-coherence.github.io/

摘要 (Abstract)

Video generative models show emerging reasoning behaviors. It is essential to ensure that generated events remain causally consistent across frames for reliable deployment, a property we define as reasoning coherence. To bridge the gap in literature for missing reasoning coherence evaluation, we propose MME-CoF-Pro, a comprehensive video reasoning benchmark to assess reasoning coherence in video models. Specifically, MME-CoF-Pro contains 303 samples across 16 categories, ranging from visual logical to scientific reasoning. It introduces Reasoning Score as evaluation metric for assessing process-level necessary intermediate reasoning steps, and includes three evaluation settings, (a) no hint (b) text hint and (c) visual hint, enabling a controlled investigation into the underlying mechanisms of reasoning hint guidance. Evaluation results in 7 open and closed-source video models reveals insights including: (1) Video generative models exhibit weak reasoning coherence, decoupled from generation quality. (2) Text hints boost apparent correctness but often cause inconsistency and hallucinated reasoning (3) Visual hints benefit structured perceptual tasks but struggle with fine-grained perception. Website: https://video-reasoning-coherence.github.io/

关键词: video generative models, reasoning coherence, evaluation benchmark, hallucination, visual reasoning, text hints, visual hints, MME-CoF-Pro

119. ❌ Deterministic Mode Proposals: An Efficient Alternative to Generative Sampling for Ambiguous Segmentation

作者: Sebastian Gerard, Josephine Sullivan 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20191v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于医学图像分割中的不确定性处理，提出了一种确定性的模式提议框架来替代生成式采样方法。论文内容与大多数关键词（如LLM、MoE、RLHF、RAG等）完全无关，因为这些关键词主要涉及大语言模型及其相关技术。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文涉及医学图像分割（AI在科学领域的应用），但并非核心焦点，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文针对医学图像分割等任务中的固有模糊性问题，提出了一种确定性的模式提议模型，通过单次前向传播生成固定大小的提议掩码集，显著减少了推理时间并提高了真实覆盖率。

摘要翻译

许多分割任务，例如医学图像分割或未来状态预测，本质上具有模糊性，这意味着存在多个同样正确的预测结果。当前方法通常依赖生成模型来捕捉这种不确定性。然而，通过这些方法识别分布的内在模式计算成本高昂，需要大量样本及事后聚类。本文中，我们将研究重点从随机采样转向直接生成可能的结果。我们提出了模式提议模型——一种确定性框架，能够在单次前向传播中高效生成固定数量的提议掩码集合。为处理冗余提议，我们将传统用于目标检测的置信度机制适配到分割掩码的高维空间中。我们的方法在显著减少推理时间的同时，比现有生成模型实现了更高的真实标注覆盖率。此外，我们证明了该模型无需知晓结果的完整分布即可进行训练，使其适用于现实世界数据集。最后，我们展示通过对预训练流模型的速度场进行分解，能够高效估算所提模式的先验概率。

摘要 (Abstract)

Many segmentation tasks, such as medical image segmentation or future state prediction, are inherently ambiguous, meaning that multiple predictions are equally correct. Current methods typically rely on generative models to capture this uncertainty. However, identifying the underlying modes of the distribution with these methods is computationally expensive, requiring large numbers of samples and post-hoc clustering. In this paper, we shift the focus from stochastic sampling to the direct generation of likely outcomes. We introduce mode proposal models, a deterministic framework that efficiently produces a fixed-size set of proposal masks in a single forward pass. To handle superfluous proposals, we adapt a confidence mechanism, traditionally used in object detection, to the high-dimensional space of segmentation masks. Our approach significantly reduces inference time while achieving higher ground-truth coverage than existing generative models. Furthermore, we demonstrate that our model can be trained without knowing the full distribution of outcomes, making it applicable to real-world datasets. Finally, we show that by decomposing the velocity field of a pre-trained flow model, we can efficiently estimate prior mode probabilities for our proposals.

关键词: segmentation, ambiguous tasks, mode proposal models, deterministic framework, medical image segmentation, inference efficiency, generative models, confidence mechanism

120. ❌ Wildfire Spread Scenarios: Increasing Sample Diversity of Segmentation Diffusion Models with Training-Free Methods

作者: Sebastian Gerard, Josephine Sullivan 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20188v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究扩散模型在分割任务中的采样多样性问题，应用于野火蔓延、医学影像和自动驾驶场景。所有关键词均与大语言模型（LLM）相关，而本文专注于计算机视觉中的扩散模型，与LLM无直接关联。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文涉及野火预测和医学诊断，属于科学应用，但并非核心焦点，因此给5分。其他关键词如MoE、Scaling Laws、RLHF等均不适用。

!!! tip deepseek-chat TL;DR

该论文解决了分割扩散模型在预测不确定环境（如野火蔓延）时采样效率低的问题，通过训练无关的采样方法（如粒子引导和SPELL）提高了样本多样性，在多个数据集上验证了有效性。

摘要翻译

在不确定环境中预测未来状态（如野火蔓延、医疗诊断或自动驾驶）需要能够考虑多种可能结果的模型。尽管扩散模型能有效学习此类多模态分布，但直接从中采样在计算上效率低下，可能需要数百次采样才能发现概率较低但仍具操作相关性的模态。本研究通过评估几种无需重新训练即可促进多样化预测的采样方法，应对高效样本模糊分割的挑战。我们将粒子引导和SPELL这两种原本为生成多样化自然图像设计的技术，适配至离散分割任务，并额外提出一种基于聚类的简易技术。我们在LIDC医学数据集、改进版Cityscapes数据集以及本文新提出的基于模拟的野火蔓延数据集MMFire上验证了这些方法。与直接采样相比，这些方法在MMFire上将HM IoU*指标最高提升了7.5%，在Cityscapes上提升了16.4%，表明无需训练的方法能够以极低的图像质量和运行时成本，有效提升分割扩散模型的样本多样性。代码与数据集：https://github.com/SebastianGer/wildfire-spread-scenarios

摘要 (Abstract)

Predicting future states in uncertain environments, such as wildfire spread, medical diagnosis, or autonomous driving, requires models that can consider multiple plausible outcomes. While diffusion models can effectively learn such multi-modal distributions, naively sampling from these models is computationally inefficient, potentially requiring hundreds of samples to find low-probability modes that may still be operationally relevant. In this work, we address the challenge of sample-efficient ambiguous segmentation by evaluating several training-free sampling methods that encourage diverse predictions. We adapt two techniques, particle guidance and SPELL, originally designed for the generation of diverse natural images, to discrete segmentation tasks, and additionally propose a simple clustering-based technique. We validate these approaches on the LIDC medical dataset, a modified version of the Cityscapes dataset, and MMFire, a new simulation-based wildfire spread dataset introduced in this paper. Compared to naive sampling, these approaches increase the HM IoU* metric by up to 7.5% on MMFire and 16.4% on Cityscapes, demonstrating that training-free methods can be used to efficiently increase the sample diversity of segmentation diffusion models with little cost to image quality and runtime. Code and dataset: https://github.com/SebastianGer/wildfire-spread-scenarios

关键词: diffusion models, segmentation, sample diversity, wildfire spread, training-free sampling, particle guidance, SPELL, MMFire dataset

121. ❌ CoVR-R:Reason-Aware Composed Video Retrieval

作者: Omkar Thawakar, Dmitry Demidov, Vaishnav Potlapalli, Sai Prasanna Teja Reddy Bogireddy, Viswanatha Reddy Gajjala, Alaa Mostafa Lasheen, Rao Muhammad Anwer, Fahad Khan 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20190v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	5.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文提出了一种基于大语言模型的零样本视频检索方法，核心创新在于引入推理机制来处理编辑文本的隐含后果。论文与’Chain of Thought’和’System 2 Thinking’高度相关（10分），因为其核心就是多步推理和深度推理过程。与’Large Language Models’相关（8分），因为使用了大型多模态模型。与’Explainable AI’相关（8分），因为强调可解释性。与’Factuality’有一定关联（5分），因为涉及效果真实性验证。其他关键词如MoE、量化、RAG等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对组合视频检索任务，提出了一种基于大型多模态模型的零样本推理方法，通过显式推理编辑的因果和时间后果来提升检索性能，并在新基准上验证了其有效性。

摘要翻译

组合视频检索（Composed Video Retrieval, CoVR）旨在根据参考视频和文本修改指令找到目标视频。先前的研究假设修改文本已完全指定视觉变化，忽略了编辑行为可能引发的后续效应与隐含后果（例如运动、状态转换、视角或时长线索）。我们认为成功的CoVR需要对这类后续效应进行推理。本文提出一种推理优先的零样本方法，利用大型多模态模型实现：（i）推断编辑行为所隐含的因果与时序后果，（ii）将推理生成的查询与候选视频进行对齐，而无需任务特定的微调。为评估CoVR中的推理能力，我们同时提出了CoVR-Reason基准数据集，该数据集为每个（参考视频，修改文本，目标视频）三元组配备了结构化的内部推理轨迹，并设置了需要预测后续效应而非关键词匹配的挑战性干扰项。实验表明，我们的零样本方法在召回率@K指标上优于强检索基线，尤其在隐含效应子集上表现突出。自动评估与人工分析均证实，我们的检索结果具有更高的步骤一致性与效应事实性。研究结果表明，通过将因果与时序后续效应显式纳入考量，将推理能力融入通用多模态模型可实现有效的CoVR。这降低了对任务特定监督的依赖，提升了对挑战性隐含效应案例的泛化能力，并增强了检索结果的可解释性。这些成果为可解释的视频搜索提供了一个可扩展且具有原则性的框架。模型、代码与基准数据集已发布于https://github.com/mbzuai-oryx/CoVR-R。

摘要 (Abstract)

Composed Video Retrieval (CoVR) aims to find a target video given a reference video and a textual modification. Prior work assumes the modification text fully specifies the visual changes, overlooking after-effects and implicit consequences (e.g., motion, state transitions, viewpoint or duration cues) that emerge from the edit. We argue that successful CoVR requires reasoning about these after-effects. We introduce a reasoning-first, zero-shot approach that leverages large multimodal models to (i) infer causal and temporal consequences implied by the edit, and (ii) align the resulting reasoned queries to candidate videos without task-specific finetuning. To evaluate reasoning in CoVR, we also propose CoVR-Reason, a benchmark that pairs each (reference, edit, target) triplet with structured internal reasoning traces and challenging distractors that require predicting after-effects rather than keyword matching. Experiments show that our zero-shot method outperforms strong retrieval baselines on recall at K and particularly excels on implicit-effect subsets. Our automatic and human analysis confirm higher step consistency and effect factuality in our retrieved results. Our findings show that incorporating reasoning into general-purpose multimodal models enables effective CoVR by explicitly accounting for causal and temporal after-effects. This reduces dependence on task-specific supervision, improves generalization to challenging implicit-effect cases, and enhances interpretability of retrieval outcomes. These results point toward a scalable and principled framework for explainable video search. The model, code, and benchmark are available at https://github.com/mbzuai-oryx/CoVR-R.

关键词: Composed Video Retrieval, Large Multimodal Models, Zero-shot Reasoning, Causal Reasoning, Temporal Reasoning, Implicit Effects, Explainable Video Search, Benchmark Evaluation

122. ❌ MuSteerNet: Human Reaction Generation from Videos via Observation-Reaction Mutual Steering

作者: Yuan Zhou, Yongzhi Li, Yanqi Dai, Xingyu Zhu, Yi Tan, Qingshan Xu, Beier Zhu, Richang Hong, Hanwang Zhang 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20187v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文MuSteerNet专注于视频驱动的3D人体反应生成，属于计算机视觉和图形学领域，研究如何从视频序列合成与观察内容匹配的3D人体动作。论文未涉及任何大语言模型（LLM）、深度学习技术原理创新或AI for Science应用，所有关键词均与论文内容完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了MuSteerNet框架，通过观察-反应相互引导机制解决视频驱动的人体反应生成中视觉观察与反应类型关系失真的问题，有效提升了3D人体反应动作与视频内容的匹配质量。

摘要翻译

视频驱动的人类反应生成旨在合成直接对观测视频序列作出反应的3D人体动作，这对于构建类人交互式AI系统至关重要。然而，现有方法往往无法有效利用视频输入来引导人类反应合成，导致生成的反应动作与视频序列内容不匹配。我们发现这一局限源于视觉观测与反应类型之间存在严重的关系扭曲。鉴于此，我们提出MuSteerNet，这是一个简单而有效的框架，通过观测-反应相互引导从视频生成3D人类反应。具体而言，我们首先提出一种原型反馈引导机制，通过基于从人类反应中学习的原型向量指导的门控增量校正调制器和关系边界约束来优化视觉观测，从而缓解关系扭曲。随后，我们引入双耦合反应细化模块，充分利用校正后的视觉线索进一步引导生成反应动作的精细化，从而有效提升反应质量，并使MuSteerNet能够实现具有竞争力的性能。大量实验与消融研究验证了我们方法的有效性。代码即将发布：https://github.com/zhouyuan888888/MuSteerNet。

摘要 (Abstract)

Video-driven human reaction generation aims to synthesize 3D human motions that directly react to observed video sequences, which is crucial for building human-like interactive AI systems. However, existing methods often fail to effectively leverage video inputs to steer human reaction synthesis, resulting in reaction motions that are mismatched with the content of video sequences. We reveal that this limitation arises from a severe relational distortion between visual observations and reaction types. In light of this, we propose MuSteerNet, a simple yet effective framework that generates 3D human reactions from videos via observation-reaction mutual steering. Specifically, we first propose a Prototype Feedback Steering mechanism to mitigate relational distortion by refining visual observations with a gated delta-rectification modulator and a relational margin constraint, guided by prototypical vectors learned from human reactions. We then introduce Dual-Coupled Reaction Refinement that fully leverages rectified visual cues to further steer the refinement of generated reaction motions, thereby effectively improving reaction quality and enabling MuSteerNet to achieve competitive performance. Extensive experiments and ablation studies validate the effectiveness of our method. Code coming soon: https://github.com/zhouyuan888888/MuSteerNet.

关键词: Human Reaction Generation, 3D Human Motions, Video-driven, Observation-Reaction Mutual Steering, Prototype Feedback Steering, Dual-Coupled Reaction Refinement, Relational Distortion, Interactive AI Systems

123. ❌ LagerNVS: Latent Geometry for Fully Neural Real-time Novel View Synthesis

作者: Stanislaw Szymanowicz, Minghao Chen, Jianyuan Wang, Christian Rupprecht, Andrea Vedaldi 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20176v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉领域的3D新视角合成任务，使用神经网络方法，但未涉及任何大语言模型、深度学习技术原理创新或科学领域应用。所有关键词均与大语言模型、深度学习技术原理或AI for Science相关，而本文是纯粹的计算机视觉/3D重建研究，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

本文提出LagerNVS神经网络，通过3D感知潜在特征实现实时、高质量的新视角合成，在Re10k数据集上达到31.4 PSNR的先进性能。

摘要翻译

近期研究表明，神经网络无需显式三维重建即可完成如新视角合成（Novel View Synthesis，NVS）等三维任务。尽管如此，我们认为在此类网络设计中引入强烈的三维归纳偏置仍然具有重要价值。为证明这一点，我们提出了LagerNVS——一种基于“三维感知”潜在特征的新视角合成编码器-解码器神经网络。该编码器由经过显式三维监督预训练的三维重建网络初始化，并搭配轻量级解码器，通过光度损失进行端到端训练。LagerNVS在确定性前馈式新视角合成任务中（包括在Re10k数据集上达到31.4 PSNR）实现了最先进的性能，无论相机参数已知与否均可实时渲染，能够泛化至真实场景数据，并可与扩散解码器结合实现生成式外推。

摘要 (Abstract)

Recent work has shown that neural networks can perform 3D tasks such as Novel View Synthesis (NVS) without explicit 3D reconstruction. Even so, we argue that strong 3D inductive biases are still helpful in the design of such networks. We show this point by introducing LagerNVS, an encoder-decoder neural network for NVS that builds on `3D-aware’ latent features. The encoder is initialized from a 3D reconstruction network pre-trained using explicit 3D supervision. This is paired with a lightweight decoder, and trained end-to-end with photometric losses. LagerNVS achieves state-of-the-art deterministic feed-forward Novel View Synthesis (including 31.4 PSNR on Re10k), with and without known cameras, renders in real time, generalizes to in-the-wild data, and can be paired with a diffusion decoder for generative extrapolation.

关键词: Novel View Synthesis, 3D reconstruction, neural networks, latent features, real-time rendering, photometric losses, encoder-decoder, state-of-the-art

124. ❌ Improving Image-to-Image Translation via a Rectified Flow Reformulation

作者: Satoshi Iizuka, Shun Okamoto, Kazuhiro Fukui 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20186v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于图像到图像转换和视频恢复任务，提出了一种基于整流流的改进方法（I2I-RFR），属于计算机视觉和生成模型领域。所有评分关键词均与大语言模型、深度学习技术原理、AI for Science等主题相关，而本文研究内容完全不涉及这些关键词，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种图像到图像整流流重构方法（I2I-RFR），通过将标准回归网络重新表述为连续时间传输模型，在保持简单训练流程的同时，实现了推理时的渐进细化，从而在多个图像转换和视频恢复任务中提高了性能，特别是在感知质量和细节保留方面。

摘要翻译

本研究提出图像到图像整流流重构方法（Image-to-Image Rectified Flow Reformulation, I2I-RFR），这是一种实用的插件式重构方案，可将标准I2I回归网络转化为连续时间传输模型。尽管像素级I2I回归方法具有简单、稳定且易于跨任务适配的优点，但其在处理不适定和多模态目标时往往会产生过度平滑的结果；而生成式替代方案通常需要额外组件、任务特定调优以及更复杂的训练与推理流程。我们的方法通过通道拼接方式将带噪声的真实目标图像与主干网络输入结合，并优化一种简单的t加权像素损失函数。该目标函数通过诱导的速度场可解释为整流流模型，从而在推理时支持基于常微分方程（ODE）的渐进优化，同时基本保留标准监督训练流程。在多数情况下，采用I2I-RFR仅需扩展输入通道，且推理过程可通过少量显式求解步骤（如3步）完成而无需蒸馏。在多种图像到图像转换和视频修复任务上的大量实验表明，I2I-RFR能普遍提升不同任务和主干网络的性能，在感知质量与细节保持方面提升尤为显著。总体而言，I2I-RFR为传统I2I模型提供了一种轻量化的连续时间优化路径，无需依赖复杂的生成式流程。

摘要 (Abstract)

In this work, we propose Image-to-Image Rectified Flow Reformulation (I2I-RFR), a practical plug-in reformulation that recasts standard I2I regression networks as continuous-time transport models. While pixel-wise I2I regression is simple, stable, and easy to adapt across tasks, it often over-smooths ill-posed and multimodal targets, whereas generative alternatives often require additional components, task-specific tuning, and more complex training and inference pipelines. Our method augments the backbone input by channel-wise concatenation with a noise-corrupted version of the ground-truth target and optimizes a simple t-reweighted pixel loss. This objective admits a rectified-flow interpretation via an induced velocity field, enabling ODE-based progressive refinement at inference time while largely preserving the standard supervised training pipeline. In most cases, adopting I2I-RFR requires only expanding the input channels, and inference can be performed with a few explicit solver steps (e.g., 3 steps) without distillation. Extensive experiments across multiple image-to-image translation and video restoration tasks show that I2I-RFR generally improves performance across a wide range of tasks and backbones, with particularly clear gains in perceptual quality and detail preservation. Overall, I2I-RFR provides a lightweight way to incorporate continuous-time refinement into conventional I2I models without requiring a heavy generative pipeline.

关键词: Image-to-Image Translation, Rectified Flow, Continuous-time Transport Models, ODE-based Progressive Refinement, Perceptual Quality, Video Restoration, Supervised Training, Inference Acceleration

125. ❌ TinyML Enhances CubeSat Mission Capabilities

作者: Luigi Capogrosso, Michele Magno 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20174v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	10.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于TinyML和卷积神经网络在CubeSat地球观测任务中的应用，核心是模型压缩和硬件优化。仅与’Quantization OR Model Compression OR Low-bit Weights’高度相关（涉及INT8量化），与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（属于AI在科学领域的应用，但非生物/化学信息学）。其他关键词均涉及大语言模型（LLM）相关技术，与论文的CNN和嵌入式AI焦点完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于TinyML的CNN模型优化和部署流程，用于CubeSat卫星上的高效图像分类，通过剪枝和量化显著减少了内存和能耗，同时保持了可接受的精度。

摘要翻译

传统地球观测（EO）任务通常依赖将卫星采集的原始或经最低限度处理的影像传输至地面站进行密集计算分析。然而，由于立方星（CubeSat）系统在星载嵌入式处理器性能、能源供给和通信带宽方面存在严格限制，这一范式难以适用。为克服这些限制，本文提出了一种基于微型机器学习（TinyML）的卷积神经网络（ConvNets）模型优化与部署流程，用于实现星载图像分类，从而在立方星级别的约束条件下实现精确、高能效且硬件感知的推理。我们的流程集成了结构化迭代剪枝、训练后INT8量化以及硬件感知算子映射技术，以压缩模型并使其与意法半导体STM32N6微控制器的异构计算架构对齐。该微控制器单元（MCU）集成了新型Arm Cortex-M55内核与Neural-ART神经处理单元（NPU），为立方星星载计算机提供了一个现实的参考平台。本文在三个地球观测基准数据集（即EuroSAT、RS_C11、MEDIC）和四种模型（即SqueezeNet、MobileNetV3、EfficientNet、MCUNetV1）上评估了所提出的方法。实验表明，优化后模型的RAM使用量平均降低89.55%，闪存占用平均降低70.09%，在保持任务可接受精度（与Float32基线相比精度下降范围在0.4至8.6个百分点之间）的同时，显著降低了下行链路带宽需求。单次推理能耗范围在0.68 mJ至6.45 mJ之间，延迟范围在3.22 ms至30.38 ms之间。这些结果完全满足了高效星载地球观测处理所要求的严格能量预算和实时性约束。

摘要 (Abstract)

Earth observation (EO) missions traditionally rely on transmitting raw or minimally processed imagery from satellites to ground stations for computationally intensive analysis. This paradigm is infeasible for CubeSat systems due to stringent constraints on the onboard embedded processors, energy availability, and communication bandwidth. To overcome these limitations, the paper presents a TinyML-based Convolutional Neural Networks (ConvNets) model optimization and deployment pipeline for onboard image classification, enabling accurate, energy-efficient, and hardware-aware inference under CubeSat-class constraints. Our pipeline integrates structured iterative pruning, post-training INT8 quantization, and hardware-aware operator mapping to compress models and align them with the heterogeneous compute architecture of the STM32N6 microcontroller from STMicroelectronics. This Microcontroller Unit (MCU) integrates a novel Arm Cortex-M55 core and a Neural-ART Neural Processing Unit (NPU), providing a realistic proxy for CubeSat onboard computers. The paper evaluates the proposed approach on three EO benchmark datasets (i.e., EuroSAT, RS_C11, MEDIC) and four models (i.e., SqueezeNet, MobileNetV3, EfficientNet, MCUNetV1). We demonstrate an average reduction in RAM usage of 89.55% and Flash memory of 70.09% for the optimized models, significantly decreasing downlink bandwidth requirements while maintaining task-acceptable accuracy (with a drop ranging from 0.4 to 8.6 percentage points compared to the Float32 baseline). The energy consumption per inference ranges from 0.68 mJ to 6.45 mJ, with latency spanning from 3.22 ms to 30.38 ms. These results fully satisfy the stringent energy budgets and real-time constraints required for efficient onboard EO processing.

关键词: TinyML, CubeSat, Convolutional Neural Networks, Model Compression, INT8 Quantization, Onboard Image Classification, Energy-efficient Inference, Earth Observation

126. ❌ Beyond Single Tokens: Distilling Discrete Diffusion Models via Discrete MMD

作者: Emiel Hoogeboom, David Ruhe, Jonathan Heek, Thomas Mensink, Tim Salimans 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20155v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究离散扩散模型的蒸馏方法（D-MMD），属于生成模型技术范畴，但所有关键词均针对大语言模型（LLM）及其相关技术（如训练、对齐、推理优化、应用等）。论文未涉及LLM、扩散模型以外的生成模型、或科学AI应用，与所有关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种用于离散扩散模型的蒸馏方法D-MMD，解决了现有方法在蒸馏时质量下降的问题，并在文本和图像数据集上验证了其有效性，蒸馏后的模型甚至能超越原模型。

摘要翻译

目前离散扩散模型的蒸馏仍存在困难。相比之下，连续扩散模型领域已有多种蒸馏方法能将采样步骤大幅减少至数次。我们提出的方法——离散矩匹配蒸馏（Discrete Moment Matching Distillation, D-MMD），借鉴了连续域中已被验证极为成功的思路。在先前离散蒸馏方法失效的情况下，D-MMD仍能保持高质量与多样性（在给定充足采样步骤时）。这一优势在文本和图像数据集上均得到验证。此外，新蒸馏出的生成器性能可超越其教师模型。

摘要 (Abstract)

It is currently difficult to distill discrete diffusion models. In contrast, continuous diffusion literature has many distillation approaches methods that can reduce sampling steps to a handful. Our method, Discrete Moment Matching Distillation (D-MMD), leverages ideas that have been highly successful in the continuous domain. Whereas previous discrete distillation methods collapse, D-MMD maintains high quality and diversity (given sufficient sampling steps). This is demonstrated on both text and image datasets. Moreover, the newly distilled generators can outperform their teachers.

关键词: discrete diffusion models, distillation, D-MMD, moment matching, text generation, image generation, model quality, sampling steps

127. ❌ EgoForge: Goal-Directed Egocentric World Simulator

作者: Yifan Shen, Jiateng Liu, Xinzhuo Li, Yuanzhe Liu, Bingxuan Li, Houze Yang, Wenqi Jia, Yijiang Li, Tianjiao Yu, James Matthew Rehg, Xu Cao, Ismini Lourentzou 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20169v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文EgoForge专注于生成式世界模型在自我中心视频模拟中的应用，核心是开发一个目标导向的自我中心世界模拟器。论文与关键词’World Models AND General World Models’高度相关（10分），因为摘要明确提到’Generative world models’，且研究内容就是构建世界模拟器。其他关键词主要涉及大语言模型（LLMs）的技术细节、训练方法、推理技术、代理系统、模型优化等，而本文研究的是视频生成和世界模拟，不涉及语言模型、代理、训练技术、推理方法或科学AI应用，因此其他关键词均为0分。

!!! tip deepseek-chat TL;DR

该论文研究如何从单张自我中心图像和高级指令生成连贯的目标导向自我中心视频，提出了EgoForge模拟器和VideoDiffusionNFT轨迹级奖励引导细化方法，实验表明其在语义对齐、几何稳定性和运动保真度上优于基线方法。

摘要翻译

生成式世界模型在动态环境模拟方面展现出潜力，但以自我为中心的视频仍因视角快速变化、频繁的手-物交互以及演化过程依赖潜在人类意图的目标导向性操作而充满挑战。现有方法要么局限于场景演化有限的手部中心指令合成，要么在不建模动作动态的情况下进行静态视角转换，或依赖密集监督（如相机轨迹、长视频前缀、同步多相机捕捉等）。本研究提出EgoForge——一种自我中心目标导向世界模拟器，它仅需极简静态输入即可生成连贯的第一人称视频序列：单张自我中心图像、一条高层级指令及一个可选的外部辅助视角。为提升意图对齐与时间一致性，我们提出VideoDiffusionNFT，这是一种轨迹级奖励引导优化方法，在扩散采样过程中同步优化目标完成度、时间因果性、场景一致性与感知保真度。大量实验表明，EgoForge在语义对齐、几何稳定性与运动保真度上均优于现有基线模型，并在真实世界智能眼镜实验中展现出鲁棒性能。

摘要 (Abstract)

Generative world models have shown promise for simulating dynamic environments, yet egocentric video remains challenging due to rapid viewpoint changes, frequent hand-object interactions, and goal-directed procedures whose evolution depends on latent human intent. Existing approaches either focus on hand-centric instructional synthesis with limited scene evolution, perform static view translation without modeling action dynamics, or rely on dense supervision, such as camera trajectories, long video prefixes, synchronized multicamera capture, etc. In this work, we introduce EgoForge, an egocentric goal-directed world simulator that generates coherent, first-person video rollouts from minimal static inputs: a single egocentric image, a high-level instruction, and an optional auxiliary exocentric view. To improve intent alignment and temporal consistency, we propose VideoDiffusionNFT, a trajectory-level reward-guided refinement that optimizes goal completion, temporal causality, scene consistency, and perceptual fidelity during diffusion sampling. Extensive experiments show EgoForge achieves consistent gains in semantic alignment, geometric stability, and motion fidelity over strong baselines, and robust performance in real-world smart-glasses experiments.

关键词: egocentric world simulator, generative world models, video rollouts, goal-directed, VideoDiffusionNFT, temporal consistency, first-person video, smart-glasses experiments

128. ❌ Generalizable NGP-SR: Generalizable Neural Radiance Fields Super-Resolution via Neural Graph Primitives

作者: Wanqi Yuan, Omkar Sharad Mayekar, Connor Pennington, Nianyi Li 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20128v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是神经辐射场（NeRF）的超分辨率问题，属于计算机视觉和图形学领域，专注于3D重建和渲染技术。论文中未涉及任何大语言模型（LLM）、深度学习技术原理创新或AI在科学领域的应用，所有关键词均与论文内容完全无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Generalizable NGP-SR的3D感知超分辨率框架，直接从低分辨率图像重建高分辨率辐射场，实现了无需逐场景优化的可泛化高分辨率新视图合成，并在多个数据集上验证了其优于现有方法的重建质量和运行效率。

摘要翻译

神经辐射场（NeRF）能够实现照片级真实感的新视角合成，但在需要高分辨率（HR）渲染时成本高昂，因为高分辨率输出需要密集采样和更高容量的模型。此外，直接在二维空间中对单视角渲染结果进行简单超分辨率处理往往会破坏多视角一致性。我们提出了一种可泛化的NGP-SR框架，这是一种三维感知的超分辨率方法，能够直接从低分辨率（LR）姿态图像中重建高分辨率辐射场。该方法基于神经图形基元（Neural Graphics Primitives, NGP）构建，通过将三维坐标和习得的局部纹理标记作为辐射预测的条件，实现了在辐射场内恢复高频细节，并能生成视角一致的高分辨率新视角，而无需依赖外部高分辨率参考或进行后处理的二维上采样。重要的是，我们的模型具有泛化能力：一旦训练完成，即可应用于未见过的场景，并可从新视角进行渲染，无需针对每个场景进行单独优化。在多个数据集上的实验表明，与先前基于NeRF的超分辨率方法相比，NGP-SR在重建质量和运行效率上均取得了一致性提升，为可扩展的高分辨率新视角合成提供了一个实用解决方案。

摘要 (Abstract)

Neural Radiance Fields (NeRF) achieve photorealistic novel view synthesis but become costly when high-resolution (HR) rendering is required, as HR outputs demand dense sampling and higher-capacity models. Moreover, naively super-resolving per-view renderings in 2D often breaks multi-view consistency. We propose Generalizable NGP-SR, a 3D-aware super-resolution framework that reconstructs an HR radiance field directly from low-resolution (LR) posed images. Built on Neural Graphics Primitives (NGP), NGP-SR conditions radiance prediction on 3D coordinates and learned local texture tokens, enabling recovery of high-frequency details within the radiance field and producing view-consistent HR novel views without external HR references or post-hoc 2D upsampling. Importantly, our model is generalizable: once trained, it can be applied to unseen scenes and rendered from novel viewpoints without per-scene optimization. Experiments on multiple datasets show that NGP-SR consistently improves both reconstruction quality and runtime efficiency over prior NeRF-based super-resolution methods, offering a practical solution for scalable high-resolution novel view synthesis.

关键词: Neural Radiance Fields, Super-Resolution, Neural Graphics Primitives, 3D Reconstruction, Novel View Synthesis, Generalizable Model, High-Resolution Rendering, View Consistency

129. ❌ Synergistic Perception and Generative Recomposition: A Multi-Agent Orchestration for Expert-Level Building Inspection

作者: Hui Zhong, Yichun Gao, Luyan Liu, Xusen Guo, Zhaonian Kuang, Qiming Zhang, Xinhu Zheng 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20143v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文《Synergistic Perception and Generative Recomposition: A Multi-Agent Orchestration for Expert-Level Building Inspection》提出了一种名为FacadeFixer的多智能体框架，用于建筑立面缺陷检测。该框架协调检测、分割和生成智能体，以协作方式处理缺陷感知任务，这与关键词’Multi-agent Systems OR Agent Coordination’高度相关（10分），因为论文核心是设计多智能体系统进行协同工作。此外，论文属于计算机视觉在基础设施检测中的应用，可视为AI在工程科学领域的应用，因此与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分）。然而，论文未涉及大语言模型（LLMs）、深度学习技术原理创新（如MoE、Scaling Laws、PEFT等）、推理方法（如CoT、System 2 Thinking）、对齐技术（如RLHF、Instruction Tuning）或其他特定大模型技术关键词，因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对建筑立面缺陷检测中因几何多变、背景复杂和标注数据稀缺导致的挑战，提出了一个多智能体协作框架FacadeFixer，通过协调检测、分割和生成智能体实现缺陷感知和语义重组，显著提升了缺陷检测性能并生成了高质量增强数据。

摘要翻译

建筑立面缺陷检测是结构健康监测与可持续城市维护的基础，但由于其几何形态极度多变、在复杂背景下对比度低，以及复合型缺陷（如裂缝与剥落并存）固有的复杂性，该任务仍面临严峻挑战。这些特性导致严重的像素不平衡与特征模糊问题，加之高质量像素级标注数据极度匮乏，共同阻碍了现有检测与分割模型的泛化能力。为弥补这些不足，我们提出\textit{FacadeFixer}，一个统一的多智能体框架，将缺陷感知视为协同推理任务而非孤立识别。具体而言，\textit{FacadeFixer} 协调专用于检测与分割的智能体以处理多类型缺陷干扰，并与生成智能体协同工作，实现语义重组。该过程将复杂缺陷从嘈杂背景中解耦，并将其逼真合成到多样化的洁净纹理上，从而生成带有精确专家级掩码的高保真增强数据。为此，我们引入了一个涵盖六种主要立面类别的综合性多任务数据集，其中包含像素级标注。大量实验表明，\textit{FacadeFixer} 显著优于当前最先进的基线方法。具体而言，它在捕捉像素级结构异常方面表现优异，并突显了生成式合成作为基础设施检测中数据稀缺问题的强有力解决方案。我们的代码与数据集将公开发布。

摘要 (Abstract)

Building facade defect inspection is fundamental to structural health monitoring and sustainable urban maintenance, yet it remains a formidable challenge due to extreme geometric variability, low contrast against complex backgrounds, and the inherent complexity of composite defects (e.g., cracks co-occurring with spalling). Such characteristics lead to severe pixel imbalance and feature ambiguity, which, coupled with the critical scarcity of high-quality pixel-level annotations, hinder the generalization of existing detection and segmentation models. To address gaps, we propose \textit{FacadeFixer}, a unified multi-agent framework that treats defect perception as a collaborative reasoning task rather than isolated recognition. Specifically,\textit{FacadeFixer} orchestrates specialized agents for detection and segmentation to handle multi-type defect interference, working in tandem with a generative agent to enable semantic recomposition. This process decouples intricate defects from noisy backgrounds and realistically synthesizes them onto diverse clean textures, generating high-fidelity augmented data with precise expert-level masks. To support this, we introduce a comprehensive multi-task dataset covering six primary facade categories with pixel-level annotations. Extensive experiments demonstrate that \textit{FacadeFixer} significantly outperforms state-of-the-art (SOTA) baselines. Specifically, it excels in capturing pixel-level structural anomalies and highlights generative synthesis as a robust solution to data scarcity in infrastructure inspection. Our code and dataset will be made publicly available.

关键词: building facade defect inspection, multi-agent framework, collaborative reasoning, generative recomposition, data augmentation, pixel-level annotations, structural health monitoring, semantic synthesis

130. ❌ Preference-Guided Debiasing for No-Reference Enhancement Image Quality Assessment

作者: Shiqi Gao, Kang Fu, Zitong Xu, Huiyu Duan, Xiongkuo Min, Jia Wang, Guangtao Zhai 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20086v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的图像质量评估（IQA），特别是针对增强图像的无参考质量评估（NR-EIQA）。论文提出了一种基于偏好引导的去偏框架，使用监督对比学习来学习增强偏好嵌入空间，并去除增强算法引入的干扰成分。所有评分关键词均与大语言模型（LLMs）、深度学习技术原理创新或AI在科学领域的应用直接相关，而本论文研究的是传统的计算机视觉任务，未涉及大模型、深度学习创新或科学AI应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对增强图像的无参考质量评估模型容易过拟合特定增强算法模式的问题，提出了一种偏好引导的去偏框架，通过监督对比学习构建增强偏好嵌入空间并去除算法相关的干扰成分，从而提高了模型的鲁棒性和跨算法泛化能力。

摘要翻译

当前针对增强图像的无参考图像质量评估（NR-IQA）模型往往泛化能力不足，因为它们容易过度拟合特定增强算法的独特模式，而非评估真实的感知质量。为解决这一问题，我们提出了一种偏好引导的去偏框架，用于无参考增强图像质量评估（EIQA）。具体而言，我们首先利用监督对比学习构建一个连续的增强偏好嵌入空间，其中鼓励由相似增强风格生成的图像获得更接近的特征表示。在此基础上，我们进一步估计原始质量表征中包含的由增强引起的干扰成分，并在质量回归前将其去除。通过这种方式，模型被引导关注算法不变的感知质量线索，而非增强特定的视觉指纹。为促进稳定优化，我们采用两阶段训练策略：先学习增强偏好空间，再进行去偏质量预测。在公开EIQA基准上的大量实验表明，所提方法能有效缓解算法引起的表征偏差，与现有方法相比，实现了更优的鲁棒性和跨算法泛化能力。

摘要 (Abstract)

Current no-reference image quality assessment (NR-IQA) models for enhanced images often struggle to generalize, as they tend to overfit to the distinct patterns of specific enhancement algorithms rather than evaluating genuine perceptual quality. To address this issue, we propose a preference-guided debiasing framework for no-reference enhancement image quality assessment (EIQA). Specifically, we first learn a continuous enhancement-preference embedding space using supervised contrastive learning, where images generated by similar enhancement styles are encouraged to have closer representations. Based on this, we further estimate the enhancement-induced nuisance component contained in the raw quality representation and remove it before quality regression. In this way, the model is guided to focus on algorithm-invariant perceptual quality cues instead of enhancement-specific visual fingerprints. To facilitate stable optimization, we adopt a two-stage training strategy that first learns the enhancement-preference space and then performs debiased quality prediction. Extensive experiments on public EIQA benchmarks demonstrate that the proposed method effectively mitigates algorithm-induced representation bias and achieves superior robustness and cross-algorithm generalization compared with existing approaches.

关键词: No-reference image quality assessment, Enhanced images, Preference-guided debiasing, Supervised contrastive learning, Enhancement-preference embedding, Cross-algorithm generalization, Algorithm-invariant perceptual quality

131. ❌ A Unified Platform and Quality Assurance Framework for 3D Ultrasound Reconstruction with Robotic, Optical, and Electromagnetic Tracking

作者: Lewis Howell, Manisha Waterston, Tze Min Wah, James H. Chandler, James R. McLaughlan 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20077v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于3D超声重建的质量保证框架和实验平台开发，涉及机器人、光学和电磁跟踪技术，属于医学影像处理领域。论文内容完全不涉及大模型、深度学习技术原理或AI在科学领域的应用，所有关键词均与大模型、深度学习、AI技术相关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该研究开发了一个用于3D超声重建的质量保证框架和开源平台，通过机器人、光学和电磁跟踪技术实现了高精度的3D重建（DSC-3D=0.94±0.01），为临床诊断和图像引导治疗提供了可靠的验证方法。

摘要翻译

三维超声成像技术能够辅助疾病诊断、治疗规划及影像引导治疗。然而现有研究鲜少对三维超声的体积测量精度与可重复性进行全面评估，这凸显了建立稳健质量保证体系的必要性，尤其针对基于自由手或机器人扫描的追踪式三维超声重建。本研究提出一套针对三维超声重建的质量保证框架，并开发了一个用于追踪式超声研究的灵活开源平台。通过定制包含多种对称性几何包埋体的仿体，本研究实现了对不同扫描速度与入射角度下光学、电磁及机器人运动学追踪系统的直接评估。标准化处理流程可在无GPU加速条件下实现对几何目标的实时分割与三维重建（DSC = 0.97，FPS = 46），随后进行自动配准并与真实几何模型对比验证。应用该框架表明，我们的机器人三维超声系统达到了当前最优重建性能（DSC-3D = 0.94 ± 0.01，HD95 = 1.17 ± 0.12），接近换能器本身的空间分辨率极限。本研究构建了灵活的三维超声重建实验平台与可重复的验证方法学，所提框架能够实现稳健的跨平台比较并改进报告规范，为三维超声在诊断及影像引导治疗应用中的安全有效临床转化提供支持。

摘要 (Abstract)

Three-dimensional (3D) Ultrasound (US) can facilitate diagnosis, treatment planning, and image-guided therapy. However, current studies rarely provide a comprehensive evaluation of volumetric accuracy and reproducibility, highlighting the need for robust Quality Assurance (QA) frameworks, particularly for tracked 3D US reconstruction using freehand or robotic acquisition. This study presents a QA framework for 3D US reconstruction and a flexible open source platform for tracked US research. A custom phantom containing geometric inclusions with varying symmetry properties enables straightforward evaluation of optical, electromagnetic, and robotic kinematic tracking for 3D US at different scanning speeds and insonation angles. A standardised pipeline performs real-time segmentation and 3D reconstruction of geometric targets (DSC = 0.97, FPS = 46) without GPU acceleration, followed by automated registration and comparison with ground-truth geometries. Applying this framework showed that our robotic 3D US achieves state-of-the-art reconstruction performance (DSC-3D = 0.94 +- 0.01, HD95 = 1.17 +- 0.12), approaching the spatial resolution limit imposed by the transducer. This work establishes a flexible experimental platform and a reproducible validation methodology for 3D US reconstruction. The proposed framework enables robust cross-platform comparisons and improved reporting practices, supporting the safe and effective clinical translation of 3D ultrasound in diagnostic and image-guided therapy applications.

关键词: 3D Ultrasound, Quality Assurance, Robotic Tracking, Optical Tracking, Electromagnetic Tracking, Reconstruction, Medical Imaging, Image-guided Therapy

132. ❌ MFil-Mamba: Multi-Filter Scanning for Spatial Redundancy-Aware Visual State Space Models

作者: Puskal Khadka, KC Santosh 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20074v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉中的状态空间模型（SSMs）和Mamba架构，提出了一种多滤波器扫描方法来解决视觉数据中的空间冗余问题。虽然论文涉及深度学习技术，但其核心内容与所有评分关键词（主要围绕大语言模型、训练方法、推理技术、对齐、代理等）完全无关。论文未提及任何语言模型、训练技术、推理方法或科学AI应用，因此所有关键词得分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于多滤波器扫描的视觉状态空间模型MFil-Mamba，解决了将状态空间模型扩展到计算机视觉时面临的空间冗余和依赖关系扭曲问题，并在图像分类、目标检测、实例分割和语义分割等多个基准测试中取得了优于现有最先进模型的性能。

摘要翻译

状态空间模型（State Space Models, SSMs），尤其是近期提出的Mamba架构，已在序列建模任务中取得显著成功。然而，由于视觉数据的非序列结构及其复杂的二维空间依赖性，将SSMs扩展至计算机视觉领域仍具挑战性。尽管已有早期研究探索将选择性SSMs应用于视觉任务，但大多数方法主要依赖于对相同输入采用多种遍历策略，这引入了冗余并破坏了图像内部复杂的空间关系。为解决这些问题，我们提出了MFil-Mamba——一种基于多滤波器扫描主干的新型视觉状态空间架构。与固定的多方向遍历方法不同，我们的设计使每次扫描能够捕获独特且与上下文相关的空间信息，同时最大限度地减少冗余。此外，除了架构上的改进，我们还引入了自适应加权机制，以有效融合来自多次扫描的输出。MFil-Mamba在包括图像分类、目标检测、实例分割和语义分割在内的多种基准测试中，均优于现有的最先进模型。例如，我们的微型变体在ImageNet-1K上达到83.2%的top-1准确率，在MS COCO上获得47.3%的边界框平均精度（box AP）和42.7%的掩码平均精度（mask AP），并在ADE20K数据集上实现48.5%的平均交并比（mIoU）。代码与模型已发布于https://github.com/puskal-khadka/MFil-Mamba。

摘要 (Abstract)

State Space Models (SSMs), especially recent Mamba architecture, have achieved remarkable success in sequence modeling tasks. However, extending SSMs to computer vision remains challenging due to the non-sequential structure of visual data and its complex 2D spatial dependencies. Although several early studies have explored adapting selective SSMs for vision applications, most approaches primarily depend on employing various traversal strategies over the same input. This introduces redundancy and distorts the intricate spatial relationships within images. To address these challenges, we propose MFil-Mamba, a novel visual state space architecture built on a multi-filter scanning backbone. Unlike fixed multi-directional traversal methods, our design enables each scan to capture unique and contextually relevant spatial information while minimizing redundancy. Furthermore, we incorporate an adaptive weighting mechanism to effectively fuse outputs from multiple scans in addition to architectural enhancements. MFil-Mamba achieves superior performance over existing state-of-the-art models across various benchmarks that include image classification, object detection, instance segmentation, and semantic segmentation. For example, our tiny variant attains 83.2% top-1 accuracy on ImageNet-1K, 47.3% box AP and 42.7% mask AP on MS COCO, and 48.5% mIoU on the ADE20K dataset. Code and models are available at https://github.com/puskal-khadka/MFil-Mamba.

关键词: State Space Models, Mamba, Computer Vision, Multi-filter Scanning, Spatial Redundancy, Image Classification, Object Detection, Semantic Segmentation

133. ❌ Layered Quantum Architecture Search for 3D Point Cloud Classification

作者: Natacha Kuete Meli, Jovita Lukasik, Vladislav Golyanik, Michael Moeller 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20024v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究量子架构搜索（layered-QAS）用于3D点云分类，属于量子机器学习领域，与绝大多数关键词（涉及大语言模型、深度学习技术原理、训练方法、推理优化等）完全无关。仅与最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），因为论文将AI（量子机器学习）应用于科学计算任务（3D点云分类），但并非核心匹配生物信息学或化学信息学。

!!! tip deepseek-chat TL;DR

该论文提出了一种分层量子架构搜索方法（layered-QAS），用于设计参数化量子电路（PQC）以解决3D点云分类问题，在ModelNet数据集上取得了基于PQC方法的最先进结果。

摘要翻译

我们提出分层量子架构搜索（layered-QAS），这是一种受经典网络形态学启发的策略，通过逐步增长和调整来设计参数化量子电路（Parametrised Quantum Circuit, PQC）架构。PQC能以较少的参数实现强大的表达能力，但其缺乏能够为特定学习任务编码归纳偏置的标准架构层（例如卷积层、注意力层）。为评估本方法的有效性，我们聚焦于三维点云分类这一具有挑战性且高度结构化的问题。此前该任务的相关研究仅将PQC用作经典分类器的特征提取器，而我们的方法则将PQC作为分类模型的核心构建模块。仿真实验表明，我们的分层量子架构搜索策略能够缓解贫瘠高原现象，其性能优于量子适配的局部搜索与进化量子架构搜索基线，并在ModelNet数据集上取得了基于PQC方法的领先成果。

摘要 (Abstract)

We introduce layered Quantum Architecture Search (layered-QAS), a strategy inspired by classical network morphism that designs Parametrised Quantum Circuit (PQC) architectures by progressively growing and adapting them. PQCs offer strong expressiveness with relatively few parameters, yet they lack standard architectural layers (e.g., convolution, attention) that encode inductive biases for a given learning task. To assess the effectiveness of our method, we focus on 3D point cloud classification as a challenging yet highly structured problem. Whereas prior work on this task has used PQCs only as feature extractors for classical classifiers, our approach uses the PQC as the main building block of the classification model. Simulations show that our layered-QAS mitigates barren plateau, outperforms quantum-adapted local and evolutionary QAS baselines, and achieves state-of-the-art results among PQC-based methods on the ModelNet dataset.

关键词: Quantum Architecture Search, Parametrised Quantum Circuit, 3D point cloud classification, network morphism, barren plateau mitigation, ModelNet dataset, quantum machine learning

134. ❌ Investigating a Policy-Based Formulation for Endoscopic Camera Pose Recovery

作者: Jan Emily Mangulabnan, Akshat Chauhan, Laura Fleig, Lalithkumar Seenivasan, Roger D. Soberanis-Mukul, S. Swaroop Vedula, Russell H. Taylor, Masaru Ishii, Gregory D. Hager, Mathias Unberath 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20045v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究内窥镜手术中的相机姿态恢复问题，提出了一种基于策略的学习方法，属于计算机视觉和医疗AI应用领域。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化等）完全无关，仅与最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因为该研究属于AI在医疗科学（内窥镜手术）中的应用，但并非核心的生物信息学或化学信息学领域，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于策略学习的方法来解决内窥镜手术中相机姿态恢复的挑战，相比传统几何方法，在低纹理条件下表现出更好的鲁棒性和准确性。

摘要翻译

在内窥镜手术中，外科医生通过结合先验知识解读术中场景不断演变的视觉外观，持续定位内窥镜视野相对于解剖结构的位置。基于视觉的导航系统旨在通过直接从内窥镜视频中恢复相机位姿来复现这种能力，但大多数方法并未体现外科医生赖以成功的那种对新帧进行推理的相同原则。相反，这些方法仍主要基于关键帧的特征匹配与几何优化，而研究表明，在内窥镜成像的低纹理、快速光照变化等挑战性条件下，此类方法的性能会下降。本文探索了一种替代方案，研究了一种基于策略的内窥镜相机位姿恢复方法，该方法旨在模仿专家根据先前相机状态估计轨迹的方式。我们的方法直接预测短时域的相对运动，无需在推理时维护显式的几何表示。因此，其设计本身即解决了基于几何方法的一些常见难题，例如脆弱的对应匹配、纹理稀疏区域的不稳定性，以及因重建失败导致的有限位姿覆盖范围。我们在尸体鼻窦内窥镜数据上评估了所提出的方法。在理想状态条件下，我们将短时域运动预测质量与几何基线方法进行比较，实现了最低的平均平移误差和具有竞争力的旋转精度。我们通过根据纹理丰富度和光照变化对预测窗口进行分组来分析其鲁棒性，结果表明该方法对低纹理条件的敏感性降低。这些发现表明，学习到的运动策略为内窥镜相机位姿恢复提供了一种可行的替代方案。

摘要 (Abstract)

In endoscopic surgery, surgeons continuously locate the endoscopic view relative to the anatomy by interpreting the evolving visual appearance of the intraoperative scene in the context of their prior knowledge. Vision-based navigation systems seek to replicate this capability by recovering camera pose directly from endoscopic video, but most approaches do not embody the same principles of reasoning about new frames that makes surgeons successful. Instead, they remain grounded in feature matching and geometric optimization over keyframes, an approach that has been shown to degrade under the challenging conditions of endoscopic imaging like low texture and rapid illumination changes. Here, we pursue an alternative approach and investigate a policy-based formulation of endoscopic camera pose recovery that seeks to imitate experts in estimating trajectories conditioned on the previous camera state. Our approach directly predicts short-horizon relative motions without maintaining an explicit geometric representation at inference time. It thus addresses, by design, some of the notorious challenges of geometry-based approaches, such as brittle correspondence matching, instability in texture-sparse regions, and limited pose coverage due to reconstruction failure. We evaluate the proposed formulation on cadaveric sinus endoscopy. Under oracle state conditioning, we compare short-horizon motion prediction quality to geometric baselines achieving lowest mean translation error and competitive rotational accuracy. We analyze robustness by grouping prediction windows according to texture richness and illumination change indicating reduced sensitivity to low-texture conditions. These findings suggest that a learned motion policy offers a viable alternative formulation for endoscopic camera pose recovery.

关键词: endoscopic surgery, camera pose recovery, policy-based formulation, motion prediction, geometric optimization, texture-sparse regions, robustness, cadaveric sinus endoscopy

135. ❌ CFCML: A Coarse-to-Fine Crossmodal Learning Framework For Disease Diagnosis Using Multimodal Images and Tabular Data

作者: Tianling Liu, Hongying Liu, Fanhua Shang, Lequan Yu, Tong Han, Liang Wan 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20016v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 论文专注于医学图像和表格数据的跨模态学习框架（CFCML），用于疾病诊断。虽然属于AI在科学（特别是生物医学）领域的应用，但内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、代理系统等）完全无关。仅与最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联，因为论文涉及AI在生物医学（疾病诊断）中的应用，但未明确提及生物信息学或化学信息学，且创新点在于跨模态学习框架而非大模型技术，因此给予8分（有一定关联但非核心）。

!!! tip deepseek-chat TL;DR

该论文提出了一种从粗到细的跨模态学习框架（CFCML），通过探索多粒度特征关系和分层锚点对比学习，有效减少了医学图像与表格数据之间的模态差距，从而在疾病诊断任务上超越了现有方法，在两个数据集上分别将AUC指标提升了1.53%和0.91%。

摘要翻译

在临床实践中，包含医学影像与表格数据在内的跨模态信息对疾病诊断至关重要。这些数据类型之间存在显著的模态鸿沟，阻碍了跨模态诊断准确性的提升。现有的大多数跨模态学习方法主要聚焦于探索高层编码器输出之间的关系，导致忽视了影像中的局部信息。此外，这些方法往往未能充分提取任务相关信息。本文提出了一种新颖的由粗到精跨模态学习框架，通过深入探索模态间关系，逐步减少多模态影像与表格数据之间的模态差异。在粗粒度阶段，我们探索来自不同图像编码器阶段的多粒度特征与表格信息之间的关系，以初步缩小模态鸿沟。在细粒度阶段，我们生成包含类别感知信息的单模态与跨模态原型，并建立基于分层锚点的关系挖掘策略，以进一步减小模态差异并提取具有判别力的跨模态信息。该策略利用模态样本、单模态原型和跨模态原型作为锚点，开发对比学习方法，从多角度有效增强类间差异同时减小类内差异。实验结果表明，我们的方法优于现有最优方法，在MEN和Derm7pt数据集上的AUC指标分别提升了1.53%和0.91%。代码公开于https://github.com/IsDling/CFCML。

摘要 (Abstract)

In clinical practice, crossmodal information including medical images and tabular data is essential for disease diagnosis. There exists a significant modality gap between these data types, which obstructs advancements in crossmodal diagnostic accuracy. Most existing crossmodal learning (CML) methods primarily focus on exploring relationships among high-level encoder outputs, leading to the neglect of local information in images. Additionally, these methods often overlook the extraction of task-relevant information. In this paper, we propose a novel coarse-to-fine crossmodal learning (CFCML) framework to progressively reduce the modality gap between multimodal images and tabular data, by thoroughly exploring inter-modal relationships. At the coarse stage, we explore the relationships between multi-granularity features from various image encoder stages and tabular information, facilitating a preliminary reduction of the modality gap. At the fine stage, we generate unimodal and crossmodal prototypes that incorporate class-aware information, and establish hierarchical anchor-based relationship mining (HRM) strategy to further diminish the modality gap and extract discriminative crossmodal information. This strategy utilize modality samples, unimodal prototypes, and crossmodal prototypes as anchors to develop contrastive learning approaches, effectively enhancing inter-class disparity while reducing intra-class disparity from multiple perspectives. Experimental results indicate that our method outperforms the state-of-the-art (SOTA) methods, achieving improvements of 1.53% and 0.91% in AUC metrics on the MEN and Derm7pt datasets, respectively. The code is available at https://github.com/IsDling/CFCML.

关键词: crossmodal learning, disease diagnosis, medical images, tabular data, modality gap, contrastive learning, multimodal fusion, clinical AI

136. ❌ Diffusion-Based Makeup Transfer with Facial Region-Aware Makeup Features

作者: Zheng Gao, Debin Meng, Yunqi Miao, Zhensong Zhang, Songcen Xu, Ioannis Patras, Jifei Song 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20012v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究基于扩散模型的化妆迁移，涉及基础模型（CLIP）的微调（关键词1得8分），使用合成数据和自监督学习进行预训练/领域适应（关键词5得5分），以及通过图像对进行监督微调（关键词6得5分）。其他关键词与论文主题（计算机视觉、图像编辑）无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于扩散模型的面部区域感知化妆迁移方法，通过微调CLIP编码器和学习区域感知化妆特征注入，实现了更好的区域可控性和化妆迁移性能。

摘要翻译

当前基于扩散模型的妆容迁移方法通常利用现成基础模型（如CLIP）编码的妆容信息作为条件，以在生成过程中保留参考图像的妆容风格。尽管这些方法有效，但主要存在两个局限性：（1）为通用任务预训练的基础模型难以准确捕捉妆容风格；（2）参考图像的妆容特征被整体注入扩散去噪模型以实现全局妆容迁移，忽略了面部区域感知的妆容特征（如眼部、唇部等），限制了针对特定区域妆容迁移的区域可控性。为解决这些问题，本研究提出面部区域感知妆容特征（Facial Region-Aware Makeup features, FRAM），其包含两个阶段：（1）妆容CLIP微调；（2）身份与面部区域感知的妆容注入。在妆容CLIP微调阶段，不同于先前工作直接使用现成CLIP模型，我们利用GPT-4o和文本驱动图像编辑模型合成带标注的妆容风格数据，随后通过自监督与图文对比学习训练一个妆容CLIP编码器。在身份与面部区域感知妆容注入阶段，我们从第一阶段编辑后的图像中构建妆前妆后图像对，并利用它们学习将源图像的身份信息与参考图像的妆容特征注入扩散去噪模型以实现妆容迁移。具体而言，我们使用可学习的令牌查询妆容CLIP编码器，以提取面部区域感知的妆容特征用于妆容注入，并通过注意力损失进行学习以实现区域控制。对于身份注入，我们采用ControlNet Union同时编码源图像及其三维网格。实验结果验证了本方法在区域可控性与妆容迁移性能上的优越性。

摘要 (Abstract)

Current diffusion-based makeup transfer methods commonly use the makeup information encoded by off-the-shelf foundation models (e.g., CLIP) as condition to preserve the makeup style of reference image in the generation. Although effective, these works mainly have two limitations: (1) foundation models pre-trained for generic tasks struggle to capture makeup styles; (2) the makeup features of reference image are injected to the diffusion denoising model as a whole for global makeup transfer, overlooking the facial region-aware makeup features (i.e., eyes, mouth, etc) and limiting the regional controllability for region-specific makeup transfer. To address these, in this work, we propose Facial Region-Aware Makeup features (FRAM), which has two stages: (1) makeup CLIP fine-tuning; (2) identity and facial region-aware makeup injection. For makeup CLIP fine-tuning, unlike prior works using off-the-shelf CLIP, we synthesize annotated makeup style data using GPT-o3 and text-driven image editing model, and then use the data to train a makeup CLIP encoder through self-supervised and image-text contrastive learning. For identity and facial region-aware makeup injection, we construct before-and-after makeup image pairs from the edited images in stage 1 and then use them to learn to inject identity of source image and makeup of reference image to the diffusion denoising model for makeup transfer. Specifically, we use learnable tokens to query the makeup CLIP encoder to extract facial region-aware makeup features for makeup injection, which is learned via an attention loss to enable regional control. As for identity injection, we use a ControlNet Union to encode source image and its 3D mesh simultaneously. The experimental results verify the superiority of our regional controllability and our makeup transfer performance.

关键词: Diffusion Models, Makeup Transfer, Facial Region-Aware, CLIP Fine-tuning, Self-supervised Learning, ControlNet, Image Editing, Regional Controllability

137. ❌ Evaluating Test-Time Adaptation For Facial Expression Recognition Under Natural Cross-Dataset Distribution Shifts

作者: John Turnbull, Shivam Grover, Amin Jalali, Ali Etemad 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19994v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉领域的深度学习模型（面部表情识别）在自然分布偏移下的测试时适应方法评估，所有关键词均与大语言模型、大模型技术原理创新或AI for Science应用无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文首次评估了测试时适应方法在面部表情识别任务中应对自然跨数据集分布偏移的效果，发现不同TTA方法在不同分布距离和噪声水平下表现各异，最高可提升性能11.34%。

摘要翻译

深度学习模型在自然分布偏移下常面临性能挑战，这是实际部署中的常见问题。测试时适应方法通过在不依赖标注源数据的情况下，在推理阶段自适应调整模型以应对此问题。本研究首次针对自然领域偏移下的面部表情识别任务，系统评估了多种测试时适应方法，并利用广泛使用的面部表情识别数据集进行了跨数据集实验。该研究超越了传统合成噪声的评估框架，重点考察由数据采集协议、标注标准及人口统计学差异导致的真实世界分布偏移。实验结果表明，在自然偏移场景下，测试时适应方法最高可将面部表情识别性能提升11.34%。当目标分布较为清晰时，基于熵最小化的方法（如TENT和SAR）表现最佳；而在分布差异较大的场景中，基于原型调整的方法（如T3A）更具优势；当目标分布比源分布噪声更显著时，特征对齐方法（如SHOT）能带来最显著的性能提升。我们的跨数据集分析表明，测试时适应的有效性主要受领域间分布距离和自然偏移严重程度的共同影响。

摘要 (Abstract)

Deep learning models often struggle under natural distribution shifts, a common challenge in real-world deployments. Test-Time Adaptation (TTA) addresses this by adapting models during inference without labeled source data. We present the first evaluation of TTA methods for FER under natural domain shifts, performing cross-dataset experiments with widely used FER datasets. This moves beyond synthetic corruptions to examine real-world shifts caused by differing collection protocols, annotation standards, and demographics. Results show TTA can boost FER performance under natural shifts by up to 11.34%. Entropy minimization methods such as TENT and SAR perform best when the target distribution is clean. In contrast, prototype adjustment methods like T3A excel under larger distributional distance scenarios. Finally, feature alignment methods such as SHOT deliver the largest gains when the target distribution is noisier than our source. Our cross-dataset analysis shows that TTA effectiveness is governed by the distributional distance and the severity of the natural shift across domains.

关键词: Test-Time Adaptation, Facial Expression Recognition, Natural Distribution Shifts, Cross-Dataset Evaluation, Domain Adaptation, Deep Learning, FER Datasets, Entropy Minimization

138. ❌ NEC-Diff: Noise-Robust Event-RAW Complementary Diffusion for Seeing Motion in Extreme Darkness

作者: Haoyue Liu, Jinghan Xu, Luxin Feng, Hanyu Zhou, Haozhi Zhao, Yi Chang, Luxin Yan 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20005v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究的是在极低光照条件下，使用事件相机和RAW图像进行互补的扩散模型框架（NEC-Diff）来重建高质量动态场景图像。该工作属于计算机视觉和计算摄影领域，专注于传感器融合、图像去噪和重建。所有关键词均与大语言模型（LLMs）、深度学习技术原理创新或AI在科学领域的应用直接相关。论文未涉及任何LLM、深度学习架构、训练方法、推理优化、对齐、代理系统或模型压缩技术。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该工作可视为AI在成像科学中的一个应用，但并非核心，因此给予5分（有一定关联）。其他关键词完全无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为NEC-Diff的扩散模型框架，通过融合事件相机和RAW图像数据，在极低光照条件下有效去噪并重建高质量的动态场景图像，并构建了一个新的低光数据集REAL用于验证。

摘要翻译

在极低光照条件下对动态场景进行高质量成像极具挑战性。光子匮乏会导致严重的噪声和纹理丢失，造成显著的图像退化。事件相机凭借其高动态范围（120 dB）和对运动的高灵敏度，通过为保留细微纹理提供关键线索，成为传统相机的有力补充。然而，现有方法大多侧重于从事件中恢复纹理，而很少关注图像噪声或事件本身固有的噪声，这最终阻碍了在光子匮乏条件下进行准确的像素重建。在本工作中，我们提出了NEC-Diff，一种新颖的基于扩散模型的事件-RAW混合成像框架，旨在从强噪声信号中提取可靠信息以重建精细的场景结构。该框架基于两个关键见解驱动：（1）结合RAW图像的线性光响应特性与事件反映亮度变化的本质，建立一个物理驱动的约束，以实现鲁棒的双模态去噪；（2）基于去噪结果动态估计两种模态的信噪比（SNR），以指导自适应特征融合，从而将可靠线索注入扩散过程，实现高保真的视觉重建。此外，我们构建了REAL（低光下采集的RAW与事件）数据集，该数据集提供了在0.001-0.8勒克斯照度下获取的47,800张像素对齐的低光RAW图像、事件流以及高质量参考图像。大量实验证明了NEC-Diff在极端黑暗环境下的优越性。项目地址为：https://github.com/jinghan-xu/NEC-Diff。

摘要 (Abstract)

High-quality imaging of dynamic scenes in extremely low-light conditions is highly challenging. Photon scarcity induces severe noise and texture loss, causing significant image degradation. Event cameras, featuring a high dynamic range (120 dB) and high sensitivity to motion, serve as powerful complements to conventional cameras by offering crucial cues for preserving subtle textures. However, most existing approaches emphasize texture recovery from events, while paying little attention to image noise or the intrinsic noise of events themselves, which ultimately hinders accurate pixel reconstruction under photon-starved conditions. In this work, we propose NEC-Diff, a novel diffusion-based event-RAW hybrid imaging framework that extracts reliable information from heavily noisy signals to reconstruct fine scene structures. The framework is driven by two key insights: (1) combining the linear light-response property of RAW images with the brightness-change nature of events to establish a physics-driven constraint for robust dual-modal denoising; and (2) dynamically estimating the SNR of both modalities based on denoising results to guide adaptive feature fusion, thereby injecting reliable cues into the diffusion process for high-fidelity visual reconstruction. Furthermore, we construct the REAL (Raw and Event Acquired in Low-light) dataset which provides 47,800 pixel-aligned low-light RAW images, events, and high-quality references under 0.001-0.8 lux illumination. Extensive experiments demonstrate the superiority of NEC-Diff under extreme darkness. The project are available at: https://github.com/jinghan-xu/NEC-Diff.

关键词: low-light imaging, event camera, RAW image, diffusion model, denoising, sensor fusion, dynamic scene reconstruction, extreme darkness

作者: Tianbao Zhang, Zhenyu Liang, Zhenbo Song, Nana Wang, Xiaomei Zhang, Xudong Cai, Zheng Zhu, Kejian Wu, Gang Wang, Zhaoxin Fan 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19964v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉中的高分辨率3D几何预测，提出了一种名为2K Retrofit的高效推理框架，通过粗预测和基于熵的稀疏细化来提升现有几何基础模型在2K分辨率下的性能。所有评分关键词均与大语言模型（LLMs）及其相关技术（如训练、对齐、推理优化、智能体等）或特定科学领域AI应用（如生物信息学）直接相关，而本文研究的是3D视觉中的几何预测，属于计算机视觉领域，与LLMs或指定的科学AI应用无直接关联，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对高分辨率3D几何预测中基础模型计算开销大的问题，提出了2K Retrofit框架，通过快速粗预测和熵引导的稀疏细化，在不修改主干模型的情况下实现了高效、高精度的2K分辨率推理。

摘要翻译

高分辨率几何预测对于自动驾驶、机器人及增强/混合现实（AR/MR）中的鲁棒感知至关重要，但现有基础模型受限于其扩展到真实世界高分辨率场景的能力，存在根本性制约。直接使用这些模型对2K图像进行推理会产生极高的计算与内存需求，导致实际部署困难。为解决这一问题，我们提出了2K Retrofit，一种新颖的框架，能够为任何几何基础模型实现高效的2K分辨率推理，且无需修改或重新训练主干网络。我们的方法利用快速的粗粒度预测和基于熵的稀疏细化策略，选择性地增强高不确定性区域，从而以最小开销实现精确且高保真的2K输出。在广泛使用的基准测试上进行的大量实验表明，2K Retrofit在精度与速度上均持续达到最先进水平，弥合了高分辨率三维视觉应用中研究进展与可扩展部署之间的差距。代码将在论文录用后公开。

摘要 (Abstract)

High-resolution geometric prediction is essential for robust perception in autonomous driving, robotics, and AR/MR, but current foundation models are fundamentally limited by their scalability to real-world, high-resolution scenarios. Direct inference on 2K images with these models incurs prohibitive computational and memory demands, making practical deployment challenging. To tackle the issue, we present 2K Retrofit, a novel framework that enables efficient 2K-resolution inference for any geometric foundation model, without modifying or retraining the backbone. Our approach leverages fast coarse predictions and an entropy-based sparse refinement to selectively enhance high-uncertainty regions, achieving precise and high-fidelity 2K outputs with minimal overhead. Extensive experiments on widely used benchmark demonstrate that 2K Retrofit consistently achieves state-of-the-art accuracy and speed, bridging the gap between research advances and scalable deployment in high-resolution 3D vision applications. Code will be released upon acceptance.

关键词: 3D geometry prediction, high-resolution inference, foundation models, sparse refinement, entropy-guided, computational efficiency, autonomous driving, AR/MR

140. ❌ Cov2Pose: Leveraging Spatial Covariance for Direct Manifold-aware 6-DoF Object Pose Estimation

作者: Nassim Ali Ousalah, Peyman Rostami, Vincent Gaudillière, Emmanuel Koumandakis, Anis Kacem, Enjie Ghorbel, Djamila Aouada 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19961v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉中的6-DoF物体姿态估计，提出了一种基于协方差池化和流形感知网络的新方法。论文内容完全围绕计算机视觉、姿态估计、卷积特征表示和几何优化展开，未涉及任何大语言模型、深度学习技术原理创新、AI for Science应用或评审背景中提到的其他大模型相关主题。所有关键词均与大模型技术、训练方法、推理优化、对齐、代理系统、科学AI应用等相关，与该论文的计算机视觉研究完全无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于协方差池化和对称正定矩阵表示的端到端6-DoF物体姿态估计方法，通过流形感知网络头提高了直接姿态回归的准确性和鲁棒性。

摘要翻译

本文针对单张RGB图像中的六自由度物体姿态估计问题展开研究。基于中间二维关键点预测并结合透视n点求解器的间接方法已展现出卓越性能。直接方法以端到端方式回归姿态，通常计算效率更高但精度较低。然而，直接预测头依赖全局池化特征，忽略了空间二阶统计量在姿态预测中的信息价值。多数情况下，这些方法还采用缺乏鲁棒性的非连续姿态表示。为此，我们提出一种协方差池化表示法，将卷积特征分布编码为对称正定矩阵。此外，我们通过楚列斯基分解提出了一种新型的SPD矩阵姿态编码方式。通过考虑SPD矩阵的黎曼几何特性，我们采用具备流形感知能力的网络头实现端到端的姿态回归。实验与消融研究一致证明了二阶池化与连续表示对直接姿态回归的有效性，该结论在部分遮挡条件下依然成立。

摘要 (Abstract)

In this paper, we address the problem of 6-DoF object pose estimation from a single RGB image. Indirect methods that typically predict intermediate 2D keypoints, followed by a Perspective-n-Point solver, have shown great performance. Direct approaches, which regress the pose in an end-to-end manner, are usually computationally more efficient but less accurate. However, direct heads rely on globally pooled features, ignoring spatial second-order statistics despite their informativeness in pose prediction. They also predict, in most cases, discontinuous pose representations that lack robustness. Herein, we therefore propose a covariance-pooled representation that encodes convolutional feature distributions as a symmetric positive definite (SPD) matrix. Moreover, we propose a novel pose encoding in the form of an SPD matrix via its Cholesky decomposition. Pose is then regressed in an end-to-end manner with a manifold-aware network head, taking into account the Riemannian geometry of SPD matrices. Experiments and ablations consistently demonstrate the relevance of second-order pooling and continuous representations for direct pose regression, including under partial occlusion.

关键词: 6-DoF object pose estimation, covariance pooling, symmetric positive definite matrix, Cholesky decomposition, manifold-aware network, direct pose regression, Riemannian geometry, partial occlusion

141. ❌ MedSPOT: A Workflow-Aware Sequential Grounding Benchmark for Clinical GUI

作者: Rozain Shakeel, Abdul Rahman Mohammad Ali, Muneeb Mushtaq, Tausifa Jan Saleem, Tajamul Ashraf 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19993v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文主要研究多模态大语言模型在临床GUI环境中的视觉定位能力评估，核心贡献是提出了一个工作流感知的顺序定位基准MedSPOT。与关键词的相关性分析：1）与’AI for Science’高度相关（10分），因为论文专注于医疗软件环境中的AI应用；2）与’Chain of Thought’和’System 2 Thinking’有一定关联（8分），因为论文强调顺序推理、工作流驱动的多步决策；3）与’LLM Agents’有一定关联（5分），因为涉及模型在动态界面中的任务执行；4）与’Large Language Models’有一定关联（5分），因为论文评估MLLMs；5）其他关键词如MoE、量化、对齐等与论文技术内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文针对多模态大语言模型在临床软件环境中缺乏可靠视觉定位能力的问题，提出了一个工作流感知的顺序定位基准MedSPOT，通过严格的顺序评估协议和系统化失败分类来评估模型在医疗工作流中的多步推理能力。

摘要翻译

尽管多模态大语言模型（MLLMs）发展迅速，其在高风险临床软件环境中执行可靠视觉定位的能力仍未得到充分探索。现有的图形用户界面（GUI）基准测试主要关注孤立、单步的定位查询，忽视了真实医疗界面中所需的顺序性、工作流驱动的推理过程，而实际任务往往在独立步骤和动态界面状态间演进。我们提出了MedSPOT，一个面向临床GUI环境的工作流感知顺序定位基准。与以往将定位视为独立预测任务的基准不同，MedSPOT将程序化交互建模为一系列结构化空间决策序列。该基准包含216个任务驱动视频及597个标注关键帧，每个任务由真实医疗工作流中2至3个相互依赖的定位步骤组成。这一设计捕捉了动态条件下界面层级结构、上下文依赖关系以及细粒度空间精度。为评估程序稳健性，我们提出一种严格的顺序评估协议，即在首次出现错误定位预测时终止任务评估，从而明确衡量多步骤工作流中的错误传播。我们进一步引入了一套全面的故障分类体系，包括边缘偏差、小目标错误、无预测、近距偏差、远距偏差及工具栏混淆，以系统诊断模型在临床GUI环境中的行为。通过将评估重点从孤立定位转向工作流感知的顺序推理，MedSPOT为评估医疗软件环境中的多模态模型建立了一个真实且安全关键的基准。代码与数据发布于：https://github.com/Tajamul21/MedSPOT。

摘要 (Abstract)

Despite the rapid progress of Multimodal Large Language Models (MLLMs), their ability to perform reliable visual grounding in high-stakes clinical software environments remains underexplored. Existing GUI benchmarks largely focus on isolated, single-step grounding queries, overlooking the sequential, workflow-driven reasoning required in real-world medical interfaces, where tasks evolve across independent steps and dynamic interface states. We introduce MedSPOT, a workflow-aware sequential grounding benchmark for clinical GUI environments. Unlike prior benchmarks that treat grounding as a standalone prediction task, MedSPOT models procedural interaction as a sequence of structured spatial decisions. The benchmark comprises 216 task-driven videos with 597 annotated keyframes, in which each task consists of 2 to 3 interdependent grounding steps within realistic medical workflows. This design captures interface hierarchies, contextual dependencies, and fine-grained spatial precision under evolving conditions. To evaluate procedural robustness, we propose a strict sequential evaluation protocol that terminates task assessment upon the first incorrect grounding prediction, explicitly measuring error propagation in multi-step workflows. We further introduce a comprehensive failure taxonomy, including edge bias, small-target errors, no prediction, near miss, far miss, and toolbar confusion, to enable systematic diagnosis of model behavior in clinical GUI settings. By shifting evaluation from isolated grounding to workflow-aware sequential reasoning, MedSPOT establishes a realistic and safety-critical benchmark for assessing multimodal models in medical software environments. Code and data are available at: https://github.com/Tajamul21/MedSPOT.

关键词: Multimodal Large Language Models, visual grounding, clinical GUI, sequential reasoning, workflow-aware benchmark, medical software, error propagation, failure taxonomy

142. ❌ Timestep-Aware Block Masking for Efficient Diffusion Model Inference

作者: Haodong He, Yuan Gao, Weizhong Zhang, Gui-Song Xia 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19939v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于扩散概率模型（DPMs）的推理效率优化，提出了一种基于时间步感知的块掩码方法，以动态决定在去噪过程的每个阶段执行或绕过哪些计算块。论文的核心内容涉及扩散模型、计算图优化、特征重用和推理加速，但所有给定的关键词均与大语言模型（LLMs）及其相关技术（如MoE、SFT、RLHF、RAG、CoT、量化、代理等）或特定科学领域AI应用（如生物信息学）相关。由于论文未涉及任何大语言模型技术、其训练方法、对齐、推理优化、代理系统或科学AI应用，因此所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对扩散概率模型推理延迟高的问题，提出了一种时间步感知的块掩码框架，通过动态优化每个时间步的计算路径，在保持生成质量的同时显著提升了采样效率。

摘要翻译

扩散概率模型在图像生成领域取得了巨大成功，但其迭代去噪的特性导致推理延迟较高。受去噪轨迹中动态特征演化的启发，我们提出了一种新颖的框架，用于在预训练的扩散概率模型上基于每个时间步优化计算图。通过学习时间步特定的掩码，我们的方法能在每个推理阶段动态决定哪些模块需要执行或通过特征复用来跳过。与那些通过全链反向传播导致极高内存成本的全局优化方法不同，我们的方法独立优化每个时间步的掩码，确保了内存高效训练过程。为引导这一过程，我们引入了时间步感知的损失缩放机制，该机制在敏感的去噪阶段优先保障特征保真度，并辅以知识引导的掩码修正策略来剪枝冗余的时空依赖。我们的方法与模型架构无关，并在包括DDPM、LDM、DiT和PixArt在内的广泛模型上展现了显著的效率提升。实验结果表明，通过将去噪过程视为一系列优化计算路径，我们的方法在采样速度与生成质量之间实现了更优的平衡。代码将公开。

摘要 (Abstract)

Diffusion Probabilistic Models (DPMs) have achieved great success in image generation but suffer from high inference latency due to their iterative denoising nature. Motivated by the evolving feature dynamics across the denoising trajectory, we propose a novel framework to optimize the computational graph of pre-trained DPMs on a per-timestep basis. By learning timestep-specific masks, our method dynamically determines which blocks to execute or bypass through feature reuse at each inference stage. Unlike global optimization methods that incur prohibitive memory costs via full-chain backpropagation, our method optimizes masks for each timestep independently, ensuring a memory-efficient training process. To guide this process, we introduce a timestep-aware loss scaling mechanism that prioritizes feature fidelity during sensitive denoising phases, complemented by a knowledge-guided mask rectification strategy to prune redundant spatial-temporal dependencies. Our approach is architecture-agnostic and demonstrates significant efficiency gains across a broad spectrum of models, including DDPM, LDM, DiT, and PixArt. Experimental results show that by treating the denoising process as a sequence of optimized computational paths, our method achieves a superior balance between sampling speed and generative quality. Our code will be released.

关键词: Diffusion Probabilistic Models, inference efficiency, timestep-aware masking, computational graph optimization, feature reuse, denoising trajectory, model acceleration, generative quality

143. ❌ SegVGGT: Joint 3D Reconstruction and Instance Segmentation from Multi-View Images

作者: Jinyuan Qu, Hongyang Li, Lei Zhang 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19926v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文SegVGGT专注于计算机视觉领域的3D重建和实例分割，使用视觉几何基础transformer（VGGT）架构，与所有大语言模型（LLM）相关技术关键词完全无关。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为3D视觉在科学领域（如生物医学成像、材料科学）有潜在应用，但论文本身未明确涉及这些具体科学领域，因此给予5分（有一定关联）。其他关键词均涉及大模型技术原理、训练方法、推理优化、对齐、代理系统等，与本文的视觉transformer和3D任务无直接联系。

!!! tip deepseek-chat TL;DR

该论文提出了SegVGGT，一个端到端的统一框架，直接从多视角RGB图像同时进行前馈3D重建和实例分割，通过引入对象查询与几何特征交互以及帧级注意力分布对齐策略，在ScanNet数据集上实现了最先进的性能。

摘要翻译

三维实例分割方法通常依赖高质量点云或配准的RGB-D扫描数据，需要复杂多阶段处理流程，且对重建噪声高度敏感。尽管前馈式Transformer模型已革新多视图三维重建领域，但其仍与高层语义理解相脱节。本研究提出SegVGGT——一个统一的端到端框架，能够直接从多视角RGB图像同步完成前馈式三维重建与实例分割。通过引入与多层级几何特征交互的对象查询机制，我们的方法将实例识别深度集成于视觉几何基座Transformer中。为应对海量全局图像标记导致的严重注意力分散问题，我们提出帧级注意力分布对齐策略。该策略在训练过程中显式引导对象查询聚焦于实例相关帧，在不增加推理开销的前提下提供结构化监督。大量实验表明，SegVGGT在ScanNetv2和ScanNet200数据集上达到最先进性能，优于近期联合模型及基于RGB-D的方法，同时在ScanNet++数据集上展现出强大的泛化能力。

摘要 (Abstract)

3D instance segmentation methods typically rely on high-quality point clouds or posed RGB-D scans, requiring complex multi-stage processing pipelines, and are highly sensitive to reconstruction noise. While recent feed-forward transformers have revolutionized multi-view 3D reconstruction, they remain decoupled from high-level semantic understanding. In this work, we present SegVGGT, a unified end-to-end framework that simultaneously performs feed-forward 3D reconstruction and instance segmentation directly from multi-view RGB images. By introducing object queries that interact with multi-level geometric features, our method deeply integrates instance identification into the visual geometry grounded transformer. To address the severe attention dispersion problem caused by the massive number of global image tokens, we propose the Frame-level Attention Distribution Alignment (FADA) strategy. FADA explicitly guides object queries to attend to instance-relevant frames during training, providing structured supervision without extra inference overhead. Extensive experiments demonstrate that SegVGGT achieves the state-of-the-art performance on ScanNetv2 and ScanNet200, outperforming both recent joint models and RGB-D-based approaches, while exhibiting strong generalization capabilities on ScanNet++.

关键词: 3D instance segmentation, multi-view reconstruction, visual geometry grounded transformer, object queries, Frame-level Attention Distribution Alignment, end-to-end framework, ScanNet, SegVGGT

144. ❌ LIORNet: Self-Supervised LiDAR Snow Removal Framework for Autonomous Driving under Adverse Weather Conditions

作者: Ji-il Park, Inwook Shim 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19936v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文LIORNet专注于自动驾驶中LiDAR传感器的雪天噪声去除问题，采用基于U-Net++的自监督学习框架，核心内容涉及计算机视觉、点云处理和自动驾驶感知。所有评分关键词均与大语言模型、深度学习技术原理或AI for Science应用直接相关，而本文研究的是特定传感器数据处理方法，未涉及大模型、深度学习创新原理或科学领域AI应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为LIORNet的自监督LiDAR雪天噪声去除框架，通过结合物理统计线索生成伪标签，在无需人工标注的情况下有效区分噪声点和环境结构，在WADS和CADC数据集上实现了优于现有方法的精度和运行效率。

摘要翻译

激光雷达传感器能够提供高分辨率的三维感知与远距离探测能力，已成为自动驾驶和机器人领域不可或缺的组成部分。然而，在雪、雨、雾等恶劣天气条件下，其性能会显著下降，此时点云中充斥着大量虚假噪声点，导致感知错误。为解决这一问题，学界已提出多种方法：包括利用空间稀疏性的基于距离的滤波器、利用反射率分布的基于强度的滤波器，以及能够适应复杂环境的学习型方法。然而，基于距离的方法难以有效区分真实物体点与噪声点；基于强度的方法通常依赖固定阈值，缺乏对动态变化条件的适应性；而学习型方法则面临标注成本高昂、泛化能力有限及计算开销大的问题。在本研究中，我们提出了LIORNet，它消除了上述缺陷，并融合了三种范式的优势。LIORNet基于U-Net++主干网络构建，采用了一种自监督学习策略，该策略由多种物理与统计线索生成的伪标签引导，这些线索包括与距离相关的强度阈值、雪的反射特性、点云稀疏性以及传感范围约束。这一设计使得LIORNet能够在无需人工标注的情况下区分噪声点与环境结构，从而克服了雪天标注的困难以及单一原理方法的局限性。在WADS和CADC数据集上进行的大量实验表明，LIORNet在准确性和运行时间上均优于当前最先进的滤波算法，同时能保留关键的环境特征。这些结果凸显了LIORNet作为一种实用且鲁棒的解决方案，在极端天气下为激光雷达感知提供了有力支持，并具备在自动驾驶系统中实时部署的强大潜力。

摘要 (Abstract)

LiDAR sensors provide high-resolution 3D perception and long-range detection, making them indispensable for autonomous driving and robotics. However, their performance significantly degrades under adverse weather conditions such as snow, rain, and fog, where spurious noise points dominate the point cloud and lead to false perception. To address this problem, various approaches have been proposed: distance-based filters exploiting spatial sparsity, intensity-based filters leveraging reflectance distributions, and learning-based methods that adapt to complex environments. Nevertheless, distance-based methods struggle to distinguish valid object points from noise, intensity-based methods often rely on fixed thresholds that lack adaptability to changing conditions, and learning-based methods suffer from the high cost of annotation, limited generalization, and computational overhead. In this study, we propose LIORNet, which eliminates these drawbacks and integrates the strengths of all three paradigms. LIORNet is built upon a U-Net++ backbone and employs a self-supervised learning strategy guided by pseudo-labels generated from multiple physical and statistical cues, including range-dependent intensity thresholds, snow reflectivity, point sparsity, and sensing range constraints. This design enables LIORNet to distinguish noise points from environmental structures without requiring manual annotations, thereby overcoming the difficulty of snow labeling and the limitations of single-principle approaches. Extensive experiments on the WADS and CADC datasets demonstrate that LIORNet outperforms state-of-the-art filtering algorithms in both accuracy and runtime while preserving critical environmental features. These results highlight LIORNet as a practical and robust solution for LiDAR perception in extreme weather, with strong potential for real-time deployment in autonomous driving systems.

关键词: LiDAR, snow removal, autonomous driving, self-supervised learning, point cloud, adverse weather, U-Net++, real-time deployment

145. ❌ ReconMIL: Synergizing Latent Space Reconstruction with Bi-Stream Mamba for Whole Slide Image Analysis

作者: Lubin Gan, Jing Zhang, Heng Zhang, Xin Di, Zhifeng Wang, Wenke Huang, Xiaoyan Sun 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19925v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文主要研究医学图像分析（全切片图像WSI），属于AI for Science（生物信息学/医学影像）领域，因此与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。论文提到使用大规模基础模型（foundation models）和Mamba序列建模，与’Large Language Models OR LLMs OR Foundation Models’相关（8分）。论文提出Latent Space Reconstruction模块进行特征适应，与’Pre-training OR Continual Pre-training OR Domain Adaptation’相关（8分）。其他关键词如MoE、SLMs、Scaling Laws、SFT、RLHF、RAG、Attention优化、推理方法、智能体、模型压缩等均未在论文中涉及，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对全切片图像分析中任务无关特征导致的领域差距和全局聚合器引起的过平滑问题，提出了ReconMIL框架，通过潜在空间重建和双流Mamba-CNN架构，在多个诊断和生存预测基准测试中优于现有方法，有效定位细粒度诊断区域并抑制背景噪声。

摘要翻译

全切片图像（Whole slide image, WSI）分析在很大程度上依赖于多示例学习（multiple instance learning, MIL）。尽管近期方法受益于大规模基础模型和先进的序列建模技术以捕捉长距离依赖关系，但仍面临两个关键问题。首先，由于与特定组织学任务存在领域差异，直接应用冻结的、任务无关的特征往往导致可分性欠佳。其次，仅依赖全局聚合器可能引发过度平滑现象，即稀疏但关键的诊断信号被占主导地位的背景信息所掩盖。本文提出ReconMIL这一新颖框架，旨在弥合领域差异并平衡全局-局部特征聚合。我们的方法引入了潜在空间重建模块，能够自适应地将通用特征投影至紧凑的任务特定流形，从而改善边界划分。为防止信息稀释，我们设计了双流架构：结合基于Mamba的全局流以获取上下文先验，以及基于CNN的局部流以保留细微的形态学异常。尺度自适应选择机制动态融合这两个流，决定何时依赖整体架构或局部显著性。在多个诊断和生存预测基准上的评估表明，ReconMIL始终优于当前最先进方法，能有效定位细粒度诊断区域并抑制背景噪声。可视化结果证实了该模型通过有效平衡全局结构与局部粒度，在定位诊断区域方面具有卓越能力。

摘要 (Abstract)

Whole slide image (WSI) analysis heavily relies on multiple instance learning (MIL). While recent methods benefit from large-scale foundation models and advanced sequence modeling to capture long-range dependencies, they still struggle with two critical issues. First, directly applying frozen, task-agnostic features often leads to suboptimal separability due to the domain gap with specific histological tasks. Second, relying solely on global aggregators can cause over-smoothing, where sparse but critical diagnostic signals are overshadowed by the dominant background context. In this paper, we present ReconMIL, a novel framework designed to bridge this domain gap and balance global-local feature aggregation. Our approach introduces a Latent Space Reconstruction module that adaptively projects generic features into a compact, task-specific manifold, improving boundary delineation. To prevent information dilution, we develop a bi-stream architecture combining a Mamba-based global stream for contextual priors and a CNN-based local stream to preserve subtle morphological anomalies. A scale-adaptive selection mechanism dynamically fuses these two streams, determining when to rely on overall architecture versus local saliency. Evaluations across multiple diagnostic and survival prediction benchmarks show that ReconMIL consistently outperforms current state-of-the-art methods, effectively localizing fine-grained diagnostic regions while suppressing background noise. Visualization results confirm the models superior ability to localize diagnostic regions by effectively balancing global structure and local granularity.

关键词: Whole Slide Image Analysis, Multiple Instance Learning, Foundation Models, Mamba, Latent Space Reconstruction, Bi-stream Architecture, Domain Adaptation, Medical Image Diagnosis

146. ❌ PanORama: Multiview Consistent Panoptic Segmentation in Operating Rooms

作者: Tuna Gürbüz, Ege Özsoy, Tony Danjun Wang, Nassir Navab 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19920v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于手术室环境中的全景分割计算机视觉任务，使用多视图一致性方法解决遮挡问题。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐技术等）完全无关，因为这些关键词都针对语言模型或通用AI技术，而本文是纯粹的计算机视觉研究。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为手术室应用属于医疗科学领域，但论文本身并未强调AI for Science的创新方法论，只是应用场景相关，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了PanORama方法，通过单次前向传播中的特征级跨视图交互，实现了手术室环境中无需相机标定的多视图一致全景分割，在MM-OR和4D-OR数据集上取得了超过70%的Panoptic Quality性能，优于现有技术。

摘要翻译

手术室（ORs）是杂乱、动态且高度遮挡的环境，在复杂的手术流程中，可靠的空间理解对于情境感知至关重要。从稀疏多视角图像中实现全景分割的空间理解提出了根本性挑战，因为部分视角的有限可见性常导致跨摄像头的错误预测。为此，我们提出了PanORama，这是首个通过设计实现多视角一致性的手术室全景分割方法。通过在骨干网络内部以单次前向传播的方式在特征层面建模跨视角交互，视角一致性直接产生，而无需后处理优化。我们在MM-OR和4D-OR数据集上进行评估，取得了超过70%的全景质量（Panoptic Quality, PQ）性能，并超越了先前的最优方法。重要的是，PanORama无需校准，不依赖相机参数，并能在推理时泛化到任何多视角配置中未见的相机视点。通过显著增强多视角分割及由此带来的手术室空间理解，我们相信该方法为手术感知与辅助开启了新的机遇。代码将在论文录用后公开。

摘要 (Abstract)

Operating rooms (ORs) are cluttered, dynamic, highly occluded environments, where reliable spatial understanding is essential for situational awareness during complex surgical workflows. Achieving spatial understanding for panoptic segmentation from sparse multiview images poses a fundamental challenge, as limited visibility in a subset of views often leads to mispredictions across cameras. To this end, we introduce PanORama, the first panoptic segmentation for the operating room that is multiview-consistent by design. By modeling cross-view interactions at the feature level inside the backbone in a single forward pass, view consistency emerges directly rather than through post-hoc refinement. We evaluate on the MM-OR and 4D-OR datasets, achieving >70% Panoptic Quality (PQ) performance, and outperforming the previous state of the art. Importantly, PanORama is calibration-free, requiring no camera parameters, and generalizes to unseen camera viewpoints within any multiview configuration at inference time. By substantially enhancing multiview segmentation and, consequently, spatial understanding in the OR, we believe our approach opens new opportunities for surgical perception and assistance. Code will be released upon acceptance.

关键词: panoptic segmentation, operating rooms, multiview consistency, surgical perception, computer vision, feature-level interaction, calibration-free, spatial understanding

147. ❌ SIMPLER: Efficient Foundation Model Adaptation via Similarity-Guided Layer Pruning for Earth Observation

作者: Víctor Barreiro, Johannes Jakubik, Francisco Argüello, Dora B. Heras 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19873v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	8.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	8.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文提出SIMPLER方法，专注于地球观测领域基础模型的高效适应，通过相似性引导的层剪枝在微调前减少模型深度。核心相关关键词：1) ‘AI for Science’ (10分)：直接应用于地球观测科学领域；2) ‘Pre-training/Domain Adaptation’ (8分)：涉及预训练模型在特定领域的适应；3) ‘Post-training/SFT’ (8分)：针对微调过程进行优化；4) ‘PEFT’ (8分)：属于参数高效微调范畴；5) ‘Quantization/Model Compression’ (8分)：通过剪枝实现模型压缩；6) ‘Foundation Models’ (5分)：使用基础模型；7) ‘Inference Acceleration’ (5分)：实现推理加速。其他关键词如MoE、LLM Agents、RAG等与论文内容无关。

!!! tip deepseek-chat TL;DR

该论文针对地球观测基础模型微调计算成本高的问题，提出了SIMPLER方法，通过相似性引导的层剪枝在微调前减少模型深度，在Prithvi-EO-2上实现了79%的参数剪枝、94%性能保留和2.6倍推理加速。

摘要翻译

针对地球观测任务微调基础模型的计算成本高昂，其训练与部署过程对训练时间和内存资源均有较高要求。参数高效方法虽能降低训练成本，但推理阶段仍保持完整的计算复杂度；而事后压缩技术则需在完成昂贵的完整微调后才能优化推理效率。本文提出SIMPLER方法——一种在微调前进行架构选择的方案，通过在模型适配前识别有效网络深度来降低推理与部署成本。该方法基于预训练视觉Transformer深层表征趋于稳定的特性：利用未标注任务数据计算层间表征相似度，通过自动化评分函数识别冗余层，整个过程无需梯度计算、无需依赖幅度启发式规则且无需超参数调优。在Prithvi-EO-2模型上的实验表明，SIMPLER可剪裁高达79%的参数同时保持94%的基线性能，实现2.1倍训练加速与2.6倍推理加速。该方法可泛化至TerraMind（多模态地球观测基础模型）及ImageNet预训练的ViT-MAE架构，证明其在不同任务、模型架构和光谱模态间的普适性。代码已发布于https://gitlab.citius.gal/hpc4rs/simpler。

摘要 (Abstract)

Fine-tuning foundation models for Earth Observation is computationally expensive, with high training time and memory demands for both training and deployment. Parameter-efficient methods reduce training cost but retain full inference complexity, while post-hoc compression optimizes inference only after costly full fine-tuning. We introduce SIMPLER, a pre-fine-tuning architecture selection method that reduces inference and deployment costs by identifying an effective model depth before adaptation. SIMPLER exploits stabilization of representations in deeper layers of pre-trained vision transformers: it computes layer-wise representation similarity on unlabeled task data and applies an automated scoring function to select redundant layers, with no gradients, magnitude heuristics, or hyperparameter tuning required. On Prithvi-EO-2, SIMPLER prunes up to 79% of parameters while retaining 94% of baseline performance, yielding a 2.1x training speedup and 2.6x inference speedup. The method generalizes to TerraMind (a multimodal EO foundation model) and ImageNet-pretrained ViT-MAE, demonstrating applicability across tasks, architectures, and spectral modalities. Code is available at https://gitlab.citius.gal/hpc4rs/simpler.

关键词: Foundation Model Adaptation, Layer Pruning, Earth Observation, Parameter-efficient Fine-tuning, Vision Transformers, Inference Acceleration, Model Compression, Pre-fine-tuning Architecture Selection

作者: Simone Magistri, Dipam Goswami, Marco Mistretta, Bartłomiej Twardowski, Joost van de Weijer, Andrew D. Bagdanov 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19862v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究CLIP模型中的模态内对齐问题，专注于视觉-语言模型的投影器分析，属于计算机视觉和多模态学习领域。所有评分关键词均针对大语言模型（LLM）技术、训练方法、推理优化、代理系统等，与论文的视觉-语言模型（VLM）研究无直接关联。论文未涉及LLM、MoE、SLM、缩放定律、预训练/后训练、对齐技术、RLHF、PEFT、RAG、上下文扩展、注意力优化、推理方法、代理系统、量化、解码加速、幻觉缓解、可解释性、世界模型、模型合并、上下文学习或科学AI应用。因此，所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

论文研究了CLIP模型中模态内对齐不足的问题，通过分析投影器并移除各向异性方向，提出了一种无需训练的方法来改善图像到图像检索等模态内任务的性能。

摘要翻译

视觉语言模型（如CLIP）被广泛用于涉及视觉与文本模态的跨模态任务。然而，当将其单模态编码器应用于本质上属于模态内任务（如图像到图像检索）时，其性能会受到模态内失准的影响。本文研究了CLIP中的模态内失准问题，重点关注将投影前的图像和文本嵌入映射到共享嵌入空间的投影器的作用。通过分析应用于投影特征的余弦相似度形式及其与对比式CLIP损失的相互作用，我们证明训练过程中存在一个负责对齐双模态的跨模态算子，以及另一个仅执行模态内归一化但无助于促进模态内对齐的模态内算子。通过对跨模态算子进行谱分析，我们识别出一个近似各向同性的子空间，其中双模态良好对齐，同时也识别出各模态特有的各向异性方向。我们证明，这种对齐子空间可直接从投影器权重中获取，且移除各向异性方向能改善模态内对齐。我们在模态内检索和分类基准上的实验表明，这种免训练方法减少了模态内失准，显著降低了延迟，并在多种预训练的类CLIP模型中超越了现有方法。代码公开于：https://github.com/simomagi/IsoCLIP。

摘要 (Abstract)

Vision-Language Models like CLIP are extensively used for inter-modal tasks which involve both visual and text modalities. However, when the individual modality encoders are applied to inherently intra-modal tasks like image-to-image retrieval, their performance suffers from the intra-modal misalignment. In this paper we study intra-modal misalignment in CLIP with a focus on the role of the projectors that map pre-projection image and text embeddings into the shared embedding space. By analyzing the form of the cosine similarity applied to projected features, and its interaction with the contrastive CLIP loss, we show that there is an inter-modal operator responsible for aligning the two modalities during training, and a second, intra-modal operator that only enforces intra-modal normalization but does nothing to promote intra-modal alignment. Via spectral analysis of the inter-modal operator, we identify an approximately isotropic subspace in which the two modalities are well-aligned, as well as anisotropic directions specific to each modality. We demonstrate that this aligned subspace can be directly obtained from the projector weights and that removing the anisotropic directions improves intra-modal alignment. Our experiments on intra-modal retrieval and classification benchmarks show that our training-free method reduces intra-modal misalignment, greatly lowers latency, and outperforms existing approaches across multiple pre-trained CLIP-like models. The code is publicly available at: https://github.com/simomagi/IsoCLIP.

关键词: CLIP, Vision-Language Models, Intra-modal Alignment, Projector Analysis, Image-to-Image Retrieval, Cosine Similarity, Spectral Analysis, Training-free Method

149. ❌ MedQ-Engine: A Closed-Loop Data Engine for Evolving MLLMs in Medical Image Quality Assessment

作者: Jiyao Liu, Junzhi Ning, Wanying Qu, Lihao Liu, Chenglong Ma, Junjun He, Ningsheng Xu 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19863v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	5.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	5.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	10.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文聚焦于医学图像质量评估（Med-IQA）中多模态大语言模型（MLLMs）的应用与改进，核心贡献是提出一个闭环数据引擎MedQ-Engine，通过迭代评估、数据驱动聚类发现失败原型、渐进式人机协同标注和质量保证微调来提升模型性能。因此，与’Large Language Models’高度相关（10分），因为MLLMs是核心模型；与’Post-training/SFT’高度相关（10分），因为涉及质量保证微调；与’Self-Correction/Self-Improvement’高度相关（10分），因为闭环引擎实现自我改进循环；与’AI for Science’高度相关（10分），因为应用于医学科学领域。与’Scaling Laws AND Data Quality’、‘Pre-training/Domain Adaptation’、‘RAG’、‘CoT Reasoning’、‘System 2 Thinking’有一定关联（各5分），分别涉及数据质量、领域适应、检索锚点、临床推理和深入推理。其他关键词如MoE、SLMs、RLHF、PEFT等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文针对医学图像质量评估中多模态大语言模型性能不足的问题，提出了一个闭环数据引擎MedQ-Engine，通过迭代评估、数据驱动聚类和渐进式标注，使一个8B参数模型在仅使用10K标注的情况下超越GPT-4o超过13%，并将与人类专家的差距缩小至4.34%。

摘要翻译

医学图像质量评估（Med-IQA）是临床人工智能部署的先决条件，然而多模态大语言模型（MLLMs）的表现仍远逊于人类专家，尤其是在需要提供超越简单质量评分的、包含临床推理的描述性评估时。但提升这些模型面临两大阻碍：获取描述性标注的高昂成本，以及一次性数据收集无法适应模型持续演进的弱点。为应对这些挑战，我们提出了MedQ-Engine，一个闭环数据引擎。该引擎通过迭代评估模型，利用数据驱动聚类发现失败原型；以这些原型为检索锚点，在百万级图像池中进行渐进式人机协同标注探索；并通过质量保证的微调实现演进，形成一个自我提升的循环。模型在互补的感知与描述任务上进行评估。一种基于熵的引导路由机制对标注任务进行分流，以最小化标注成本。在五种医学成像模态上的实验表明，MedQ-Engine仅使用1万次标注，就将一个80亿参数模型的性能提升至超越GPT-4o超过13%，并将与人类专家的差距缩小至仅4.34%，其样本效率相比随机采样提升了4倍以上。

摘要 (Abstract)

Medical image quality assessment (Med-IQA) is a prerequisite for clinical AI deployment, yet multimodal large language models (MLLMs) still fall substantially short of human experts, particularly when required to provide descriptive assessments with clinical reasoning beyond simple quality scores. However, improving them is hindered by the high cost of acquiring descriptive annotations and by the inability of one-time data collection to adapt to the model’s evolving weaknesses. To address these challenges, we propose MedQ-Engine, a closed-loop data engine that iteratively evaluates the model to discover failure prototypes via data-driven clustering, explores a million-scale image pool using these prototypes as retrieval anchors with progressive human-in-the-loop annotation, and evolves through quality-assured fine-tuning, forming a self-improving cycle. Models are evaluated on complementary perception and description tasks. An entropy-guided routing mechanism triages annotations to minimize labeling cost. Experiments across five medical imaging modalities show that MedQ-Engine elevates an 8B-parameter model to surpass GPT-4o by over 13% and narrow the gap with human experts to only 4.34%, using only 10K annotations with more than 4x sample efficiency over random sampling.

关键词: Medical image quality assessment, Multimodal large language models, Closed-loop data engine, Iterative evaluation, Progressive human-in-the-loop annotation, Fine-tuning, Self-improving cycle, Clinical reasoning

150. ❌ FoleyDirector: Fine-Grained Temporal Steering for Video-to-Audio Generation via Structured Scripts

作者: You Li, Dewei Zhou, Fan Ma, Fu Li, Dongliang He, Yi Yang 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19857v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于视频到音频生成（V2A）的细粒度时间控制问题，提出FoleyDirector框架、结构化时间脚本（STS）和双帧声音合成方法。虽然属于AI应用领域，但所有评分关键词都针对大语言模型（LLM）和深度学习技术原理的创新，而本文研究的是基于扩散变换器（DiT）的视频到音频生成，不涉及LLM、MoE、缩放定律、对齐、推理、代理、压缩等任何评分关键词的技术内容，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

本文解决了视频到音频生成中细粒度时间控制不足的问题，提出了FoleyDirector框架，通过结构化时间脚本和双帧声音合成实现了精确的时间指导，同时保持了音频质量。

摘要翻译

近期视频到音频（V2A）方法取得了显著进展，能够合成逼真的高质量音频。然而，在多事件场景或视觉线索不足的情况下（例如小区域、屏幕外声音、被遮挡或部分可见的物体），现有方法难以实现细粒度的时间控制。本文提出FoleyDirector框架，首次在基于扩散变换器（DiT）的V2A生成中实现精确的时间引导，同时保持基础模型的音频质量，并允许在V2A生成与时间可控合成之间无缝切换。FoleyDirector引入了结构化时间脚本（Structured Temporal Scripts, STS），即一组对应短时段落的描述文本，以提供更丰富的时间信息。这些特征通过脚本引导的时间融合模块进行整合，该模块采用时间脚本注意力机制以连贯地融合STS特征。为处理复杂多事件场景，我们进一步提出双帧声音合成方法，支持并行生成画面内与画面外音频，从而提升可控性。为支持训练与评估，我们构建了DirectorSound数据集，并提出了VGGSoundDirector与DirectorBench基准。实验表明，FoleyDirector在保持高音频保真度的同时显著增强了时间可控性，使用户能够扮演拟音导演的角色，推动V2A技术向更具表现力与可控性的生成方向发展。

摘要 (Abstract)

Recent Video-to-Audio (V2A) methods have achieved remarkable progress, enabling the synthesis of realistic, high-quality audio. However, they struggle with fine-grained temporal control in multi-event scenarios or when visual cues are insufficient, such as small regions, off-screen sounds, or occluded or partially visible objects. In this paper, we propose FoleyDirector, a framework that, for the first time, enables precise temporal guidance in DiT-based V2A generation while preserving the base model’s audio quality and allowing seamless switching between V2A generation and temporally controlled synthesis. FoleyDirector introduces Structured Temporal Scripts (STS), a set of captions corresponding to short temporal segments, to provide richer temporal information. These features are integrated via the Script-Guided Temporal Fusion Module, which employs Temporal Script Attention to fuse STS features coherently. To handle complex multi-event scenarios, we further propose Bi-Frame Sound Synthesis, enabling parallel in-frame and out-of-frame audio generation and improving controllability. To support training and evaluation, we construct the DirectorSound dataset and introduce VGGSoundDirector and DirectorBench. Experiments demonstrate that FoleyDirector substantially enhances temporal controllability while maintaining high audio fidelity, empowering users to act as Foley directors and advancing V2A toward more expressive and controllable generation.

关键词: Video-to-Audio Generation, Temporal Control, Structured Temporal Scripts, Bi-Frame Sound Synthesis, Diffusion Transformer, FoleyDirector, Audio Synthesis, Controllable Generation

151. ❌ Fourier Splatting: Generalized Fourier encoded primitives for scalable radiance fields

作者: Mihnea-Bogdan Jurca, Bert Van hauwermeiren, Adrian Munteanu 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19834v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉和图形学领域的新视图合成技术，提出了一种基于傅里叶编码的辐射场渲染方法。虽然属于AI应用，但所有关键词均与大语言模型、深度学习技术原理、科学AI应用等无关。论文内容涉及3D高斯溅射、傅里叶描述符、MCMC优化等，与评分关键词列表中的大模型技术、训练方法、推理优化、AI代理等主题无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为Fourier Splatting的可扩展辐射场渲染方法，通过傅里叶编码描述符参数化平面基元，实现了单一训练模型在不同细节级别的渲染，并在标准基准测试中达到了最先进的渲染质量。

摘要翻译

新型视图合成技术近期因三维高斯泼溅（3DGS）实现了革命性突破，该方法通过显式基元栅格化实现了实时渲染。然而，现有方法将视觉保真度严格与基元数量绑定：质量降级仅能通过删减基元实现。我们提出了辐射场渲染领域首个本质可扩展的基元表示方法。傅里叶泼溅采用可扩展基元，其任意闭合形状通过傅里叶编码描述符参数化平面面元获得。该表述使得单一训练模型能在运行时仅通过截断傅里叶系数，即可实现多细节层次渲染。为保障优化稳定性，我们采用直通估计器将梯度扩展至基元边界之外，并引入HYDRA——一种在马尔可夫链蒙特卡洛框架内将复杂基元分解为简单组件的致密化策略。本方法在平面基元框架中实现了最先进的渲染质量，在标准基准测试中与主流体素表示方法达到相当的感知指标，为带宽受限的高保真渲染提供了通用解决方案。

摘要 (Abstract)

Novel view synthesis has recently been revolutionized by 3D Gaussian Splatting (3DGS), which enables real-time rendering through explicit primitive rasterization. However, existing methods tie visual fidelity strictly to the number of primitives: quality downscaling is achieved only through pruning primitives. We propose the first inherently scalable primitive for radiance field rendering. Fourier Splatting employs scalable primitives with arbitrary closed shapes obtained by parameterizing planar surfels with Fourier encoded descriptors. This formulation allows a single trained model to be rendered at varying levels of detail simply by truncating Fourier coefficients at runtime. To facilitate stable optimization, we employ a straight-through estimator for gradient extension beyond the primitive boundary, and introduce HYDRA, a densification strategy that decomposes complex primitives into simpler constituents within the MCMC framework. Our method achieves state-of-the-art rendering quality among planar-primitive frameworks and comparable perceptual metrics compared to leading volumetric representations on standard benchmarks, providing a versatile solution for bandwidth-constrained high-fidelity rendering.

关键词: Fourier Splatting, radiance field rendering, 3D Gaussian Splatting, novel view synthesis, scalable primitives, Fourier encoded descriptors, real-time rendering, HYDRA densification strategy

作者: Lokendra Kumar, Shubham Aggarwal 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19844v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于医学图像分割领域，提出了一种名为Hyper-Connections（HC）的动态连接机制，用于改进多模态MRI脑肿瘤分割。论文的核心是计算机视觉和医学图像分析，而非大语言模型（LLM）或深度学习技术原理的创新。所有关键词均与LLM、模型训练、推理优化、对齐、代理系统等直接相关，而本文研究的是特定于卷积神经网络（CNN）和Transformer架构的医学图像分割方法。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为医学图像分析是AI在科学（特别是生物医学）领域的一个应用，但论文并未涉及生物信息学或化学信息学的具体内容，因此给予10分（高度相关，核心内容）。其他关键词与论文主题完全无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文研究了Hyper-Connections（一种动态连接机制）在多模态MRI脑肿瘤分割中的应用，实验表明它能一致提升多种3D分割模型的性能，尤其在增强肿瘤区域，且对主导临床序列的敏感性更高。

摘要翻译

本研究首次将超连接（Hyper-Connections, HC）应用于体素级多模态脑肿瘤分割，将其作为即插即用模块替换了五种架构（nnU-Net、SwinUNETR、VT-UNet、U-Net和U-Netpp）中的固定残差连接。在BraTS 2021数据集上，动态HC持续提升了所有三维模型的性能，平均Dice系数最高提升达+1.03%，而参数量开销可忽略不计。提升在增强肿瘤子区域最为显著，体现了细粒度边界刻画能力的改进。模态消融实验进一步表明，配备HC的模型对临床主导序列表现出更敏锐的敏感性——具体而言，T1ce序列对肿瘤核心与增强肿瘤区域、FLAIR序列对全肿瘤区域的重要性更为突出，这一特性在固定连接基线中并未出现，且在所有架构中保持一致。在二维场景中，性能提升较小且对配置敏感，表明体素空间上下文放大了自适应特征聚合的优势。这些结果证实了HC是一种简单、高效且广泛适用的机制，可用于医学图像分割中的多模态特征融合。

摘要 (Abstract)

We present the first study of Hyper-Connections (HC) for volumetric multi-modal brain tumor segmentation, integrating them as a drop-in replacement for fixed residual connections across five architectures: nnU-Net, SwinUNETR, VT-UNet, U-Net, and U-Netpp. Dynamic HC consistently improves all 3D models on the BraTS 2021 dataset, yielding up to +1.03 percent mean Dice gain with negligible parameter overhead. Gains are most pronounced in the Enhancing Tumor sub-region, reflecting improved fine-grained boundary delineation. Modality ablation further reveals that HC-equipped models develop sharper sensitivity toward clinically dominant sequences, specifically T1ce for Tumor Core and Enhancing Tumor, and FLAIR for Whole Tumor, a behavior absent in fixed-connection baselines and consistent across all architectures. In 2D settings, improvements are smaller and configuration-sensitive, suggesting that volumetric spatial context amplifies the benefit of adaptive aggregation. These results establish HC as a simple, efficient, and broadly applicable mechanism for multi-modal feature fusion in medical image segmentation.

关键词: Hyper-Connections, multi-modal MRI, brain tumor segmentation, volumetric segmentation, adaptive feature fusion, medical image analysis, BraTS 2021, 3D models

153. ❌ HUGE-Bench: A Benchmark for High-Level UAV Vision-Language-Action Tasks

作者: Jingyu Guo, Ziye Chen, Ziwen Li, Zhengqing Gao, Jiaxin Huang, Hanlue Zhang, Fengming Huang, Yu Yao, Tongliang Liu, Mingming Gong 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19822v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于无人机视觉-语言-动作（VLA）任务的基准测试，核心是评估无人机代理如何将高级语言指令转化为安全的多阶段行为。论文涉及视觉-语言导航、3D场景表示（3D高斯泼溅）、基准构建和评估指标，但完全不涉及大语言模型（LLM）或深度学习技术原理的创新。所有关键词均与大语言模型技术、训练方法、推理优化、对齐、代理系统等直接相关，而本文研究的是无人机领域的特定视觉-语言任务基准，未使用或改进任何大模型技术，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了HUGE-Bench基准，用于测试无人机代理将简洁高级语言指令转化为安全、复杂、过程导向轨迹的能力，实验发现现有VLA模型在高级语义完成和安全执行方面存在显著差距。

摘要翻译

现有无人机视觉语言导航基准测试已能实现语言引导飞行，但其主要关注以目标为中心的长篇分步路径描述评估，难以有效诊断实际作业场景——此类场景需将简洁的高层指令转化为安全的多阶段行为。我们提出HUGE-Bench高层无人机视觉语言动作基准测试，用于检验智能体能否解析简明语言指令，并以安全感知能力执行复杂的面向过程轨迹。该基准包含4个真实世界数字孪生场景、8类高层任务及256万米轨迹数据，基于对齐的3D高斯溅射-网格表征系统构建，融合了照片级真实感渲染与可碰撞几何体，支持规模化生成与碰撞感知评估。我们提出面向过程与碰撞感知的度量标准，以评估过程保真度、终端精度及安全性。对代表性前沿视觉语言动作模型的实验表明，其在高层语义完成度与安全执行方面存在显著差距，凸显HUGE-Bench可作为高层无人机自主性的诊断测试平台。

摘要 (Abstract)

Existing UAV vision-language navigation (VLN) benchmarks have enabled language-guided flight, but they largely focus on long, step-wise route descriptions with goal-centric evaluation, making them less diagnostic for real operations where brief, high-level commands must be grounded into safe multi-stage behaviors. We present HUGE-Bench, a benchmark for High-Level UAV Vision-Language-Action (HL-VLA) tasks that tests whether an agent can interpret concise language and execute complex, process-oriented trajectories with safety awareness. HUGE-Bench comprises 4 real-world digital twin scenes, 8 high-level tasks, and 2.56M meters of trajectories, and is built on an aligned 3D Gaussian Splatting (3DGS)-Mesh representation that combines photorealistic rendering with collision-capable geometry for scalable generation and collision-aware evaluation. We introduce process-oriented and collision-aware metrics to assess process fidelity, terminal accuracy, and safety. Experiments on representative state-of-the-art VLA models reveal significant gaps in high-level semantic completion and safe execution, highlighting HUGE-Bench as a diagnostic testbed for high-level UAV autonomy.

关键词: UAV, Vision-Language-Action, Benchmark, High-Level Commands, 3D Gaussian Splatting, Collision-Aware Evaluation, Process-Oriented Metrics, Autonomous Agents

154. ❌ Evaluating Vision Foundation Models for Pixel and Object Classification in Microscopy

作者: Carolin Teuber, Anwai Archit, Tobias Boothe, Peter Ditte, Jochen Rink, Constantin Pape 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19802v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于评估视觉基础模型（VFMs）在显微镜图像像素和对象分类中的应用，属于生物医学成像领域。论文与大多数关键词（主要涉及大语言模型技术原理、训练方法、推理优化等）完全无关，因为这些关键词针对的是文本大模型而非视觉模型。唯一相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文明确研究生物医学成像（显微镜）中的AI应用，属于AI for Science范畴，且摘要提到’biomedical imaging’，因此给予10分（高度相关）。

!!! tip deepseek-chat TL;DR

该研究评估了多种视觉基础模型在显微镜图像像素和对象分类任务中的性能，结果表明这些模型相比传统方法能带来一致性的改进，并为该领域建立了基准。

摘要翻译

深度学习构成了计算机视觉领域（包括生物医学成像）大多数现代方法与工具的基础。然而，对于交互式语义分割（在此背景下常称为像素分类）和交互式对象级分类（对象分类），基于特征的浅层学习仍然被广泛使用。这是由于该领域数据的多样性、缺乏大规模预训练数据集，以及对计算效率和标注效率的需求。相比之下，用于显微镜下许多其他视觉任务（最显著的是细胞实例分割）的先进工具已依赖于深度学习，并且最近从视觉基础模型（Vision Foundation Models, VFMs），特别是SAM中获得了显著提升。本文旨在探究，与现有方法相比，VFMs是否也能改进像素和对象分类。为此，我们在五个多样化且具有挑战性的数据集上，评估了多种VFMs（包括通用模型如SAM、SAM2、DINOv3，以及领域特定模型如$μ$SAM、PathoSAM）与浅层学习及注意力探测方法的结合效果。我们的研究结果表明，相较于手工设计的特征，这些方法带来了持续的改进，并为实际应用提升提供了清晰的路径。此外，本研究为VFMs在显微成像领域的应用建立了基准，并为该领域的未来发展提供了参考。

摘要 (Abstract)

Deep learning underlies most modern approaches and tools in computer vision, including biomedical imaging. However, for interactive semantic segmentation (often called pixel classification in this context) and interactive object-level classification (object classification), feature-based shallow learning remains widely used. This is due to the diversity of data in this domain, the lack of large pretraining datasets, and the need for computational and label efficiency. In contrast, state-of-the-art tools for many other vision tasks in microscopy - most notably cellular instance segmentation - already rely on deep learning and have recently benefited substantially from vision foundation models (VFMs), particularly SAM. Here, we investigate whether VFMs can also improve pixel and object classification compared to current approaches. To this end, we evaluate several VFMs, including general-purpose models (SAM, SAM2, DINOv3) and domain-specific ones ($μ$SAM, PathoSAM), in combination with shallow learning and attentive probing on five diverse and challenging datasets. Our results demonstrate consistent improvements over hand-crafted features and provide a clear pathway toward practical improvements. Furthermore, our study establishes a benchmark for VFMs in microscopy and informs future developments in this area.

关键词: vision foundation models, microscopy, pixel classification, object classification, biomedical imaging, SAM, benchmark, deep learning

155. ❌ Controllable Text-to-Motion Generation via Modular Body-Part Phase Control

作者: Minyue Dai, Ke Fan, Anyi Rao, Jingbo Wang, Bo Dai 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19795v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于文本到运动生成（Text-to-Motion）的特定计算机视觉/图形学任务，提出了一种基于模块化身体部位相位控制的框架，用于实现可控的局部运动编辑。论文的核心技术涉及生成模型（扩散模型和流模型）、运动表示、相位信号建模和特征调制，但完全不涉及大语言模型（LLMs）、深度学习技术原理创新（如MoE、缩放定律、对齐、推理方法等）或大模型在不同领域的应用（如科学AI）。所有评分关键词均与大语言模型及其相关技术、应用或评估相关，而本文研究领域（运动生成）与这些关键词无直接关联，因此所有关键词相关度评分为0。

!!! tip deepseek-chat TL;DR

该论文解决了文本到运动生成中难以对特定身体部位进行局部编辑同时保持整体运动连贯性的问题，提出了一种基于模块化身体部位相位控制的框架，通过相位信号建模和特征调制实现了对运动幅度、速度和时序的细粒度可控编辑。

摘要翻译

文本驱动动作生成正逐渐成为动画与交互式虚拟形象制作的实用工具。然而，在保持整体运动连贯性的同时修改特定身体部位仍具挑战性。现有方法通常依赖于繁琐的高维关节约束（如轨迹），这阻碍了用户友好的迭代优化。为此，我们提出模块化身体部位相位控制，这是一种即插即用框架，通过紧凑的标量相位界面实现结构化、局部化的编辑。通过将身体部位潜在运动通道建模为以振幅、频率、相位偏移和偏移量为特征的正弦相位信号，我们提取出可解释的编码以捕捉部位特异性动态。随后，模块化相位控制网络分支通过残差特征调制注入该信号，无缝地将控制与生成主干解耦。在基于扩散和流模型的实验表明，我们的方法能够对运动幅度、速度与时序提供可预测的细粒度控制，同时保持全局运动连贯性，为可控文本驱动动作生成提供了实用范式。项目页面：https://jixiii.github.io/bp-phase-project-page/

摘要 (Abstract)

Text-to-motion (T2M) generation is becoming a practical tool for animation and interactive avatars. However, modifying specific body parts while maintaining overall motion coherence remains challenging. Existing methods typically rely on cumbersome, high-dimensional joint constraints (e.g., trajectories), which hinder user-friendly, iterative refinement. To address this, we propose Modular Body-Part Phase Control, a plug-and-play framework enabling structured, localized editing via a compact, scalar-based phase interface. By modeling body-part latent motion channels as sinusoidal phase signals characterized by amplitude, frequency, phase shift, and offset, we extract interpretable codes that capture part-specific dynamics. A modular Phase ControlNet branch then injects this signal via residual feature modulation, seamlessly decoupling control from the generative backbone. Experiments on both diffusion- and flow-based models demonstrate that our approach provides predictable and fine-grained control over motion magnitude, speed, and timing. It preserves global motion coherence and offers a practical paradigm for controllable T2M generation. Project page: https://jixiii.github.io/bp-phase-project-page/

关键词: Text-to-Motion Generation, Controllable Generation, Modular Body-Part Phase Control, Phase Signal Modeling, Motion Editing, Diffusion Models, Flow-based Models, Feature Modulation

156. ❌ From Plausibility to Verifiability: Risk-Controlled Generative OCR for Vision-Language Models

作者: Weile Gong, Yiping Zuo, Zijian Lu, Xin He, Weibei Fan, Chen Dai 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19790v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是视觉语言模型（VLMs）在生成式OCR任务中的风险控制问题，提出了一种几何风险控制器来减少极端错误和灾难性过度生成。所有评分关键词都专注于大语言模型（LLMs）及其相关技术（如MoE、Scaling Laws、RLHF、RAG、量化等），而本文的核心是视觉语言模型（VLMs）在OCR应用中的部署风险控制，并未涉及LLMs技术原理、训练方法、推理优化、对齐、代理系统或科学AI应用。因此，所有关键词与论文内容完全无关，得分为0。

!!! tip deepseek-chat TL;DR

论文针对视觉语言模型作为生成式OCR引擎时因自回归解码偏好语义合理性而非视觉可验证性导致的部署风险，提出了一种模型无关的几何风险控制器，通过多视图共识和稳定性筛选来减少极端错误和过度生成，实验证明该方法能有效控制风险但会带来可预测的覆盖成本。

摘要翻译

现代视觉语言模型（VLMs）可作为生成式OCR引擎，但开放式解码可能暴露罕见却影响严重的故障。我们识别出生成式OCR中存在一个核心部署错位问题：自回归解码倾向于语义合理性，而OCR要求输出具备视觉依据和几何可验证性。这种不匹配会导致严重错误，特别是过度生成和缺乏支持的替换，即使在基准准确率保持高位时仍会带来部署风险。因此，我们将冻结VLM的OCR任务形式化为选择性接受/弃权问题，并提出一种模型无关的几何风险控制器。该控制器通过探测同一输入的多个结构化视图，应用轻量级结构筛选，仅当跨视图共识和稳定性满足预设标准时才接受转录结果，从而形成一个小型操作点集合。在冻结VLM主干和标准OCR基准上的实验表明，该方法能在可预测的覆盖成本下，持续降低极端错误风险和灾难性过度生成现象。冻结VLM生成式OCR的可靠部署得益于显式的系统级风险控制，而非无约束的生成过程。

摘要 (Abstract)

Modern vision-language models (VLMs) can act as generative OCR engines, yet open-ended decoding can expose rare but consequential failures. We identify a core deployment misalignment in generative OCR. Autoregressive decoding favors semantic plausibility, whereas OCR requires outputs that are visually grounded and geometrically verifiable. This mismatch produces severe errors, especially over-generation and unsupported substitutions, creating deployment risk even when benchmark accuracy remains high. We therefore formulate frozen VLM OCR as a selective accept/abstain problem and propose a model-agnostic Geometric Risk Controller. The controller probes multiple structured views of the same input, applies lightweight structural screening, and accepts a transcription only when cross-view consensus and stability satisfy predefined criteria, yielding a small family of operating points. Experiments on frozen VLM backbones and standard OCR benchmarks show consistent reductions in extreme-error risk and catastrophic over-generation at predictable coverage costs. Reliable deployment of generative OCR with frozen VLMs benefits from explicit system-level risk control rather than unconstrained generation.

关键词: Vision-Language Models, Generative OCR, Risk Control, Geometric Verifiability, Over-generation, Selective Abstain, Cross-view Consensus, Deployment Safety

157. ❌ Decoupled Sensitivity-Consistency Learning for Weakly Supervised Video Anomaly Detection

作者: Hantao Zheng, Ning Han, Yawen Zeng, Hao Chen 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19780v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于弱监督视频异常检测的计算机视觉任务，提出了一种解耦的敏感度-一致性学习框架。虽然使用了深度学习技术，但论文内容与所有评分关键词（均围绕大模型技术原理、训练方法、推理优化、对齐技术、科学应用等）完全无关，没有涉及任何大语言模型、MoE、量化、RAG、对齐、科学AI等概念。

!!! tip deepseek-chat TL;DR

该论文针对弱监督视频异常检测中敏感度与稳定性之间的权衡问题，提出了一种解耦的敏感度-一致性学习框架，在UCF-Crime和XD-Violence数据集上取得了新的最优性能。

摘要翻译

近期，弱监督视频异常检测方法通过采用统一框架进行联合优化取得了显著进展。然而，该范式受限于一个根本性的敏感度-稳定性权衡问题：由于检测瞬时异常与持续异常的目标存在冲突，现有方法往往产生碎片化预测或过度平滑的响应。为突破这一局限，我们提出DeSC——一种新颖的解耦敏感度-一致性框架，该框架通过差异化优化策略训练两个专用流。时序敏感流采用激进优化策略以捕捉高频突变，而语义一致流则施加鲁棒约束以保持长期连贯性并降低噪声。二者通过协同推理机制融合互补优势，该机制能减少个体偏差并生成平衡的预测结果。大量实验表明，DeSC在UCF-Crime数据集上以89.37% AUC（提升1.29%），在XD-Violence数据集上以87.18% AP（提升2.22%）创造了新的最优性能。代码已发布于https://github.com/imzht/DeSC。

摘要 (Abstract)

Recent weakly supervised video anomaly detection methods have achieved significant advances by employing unified frameworks for joint optimization. However, this paradigm is limited by a fundamental sensitivity-stability trade-off, as the conflicting objectives for detecting transient and sustained anomalies lead to either fragmented predictions or over-smoothed responses. To address this limitation, we propose DeSC, a novel Decoupled Sensitivity-Consistency framework that trains two specialized streams using distinct optimization strategies. The temporal sensitivity stream adopts an aggressive optimization strategy to capture high-frequency abrupt changes, whereas the semantic consistency stream applies robust constraints to maintain long-term coherence and reduce noise. Their complementary strengths are fused through a collaborative inference mechanism that reduces individual biases and produces balanced predictions. Extensive experiments demonstrate that DeSC establishes new state-of-the-art performance by achieving 89.37% AUC on UCF-Crime (+1.29%) and 87.18% AP on XD-Violence (+2.22%). Code is available at https://github.com/imzht/DeSC.

关键词: weakly supervised video anomaly detection, sensitivity-consistency trade-off, decoupled framework, temporal sensitivity stream, semantic consistency stream, collaborative inference, UCF-Crime, XD-Violence

158. ❌ One Model, Two Minds: Task-Conditioned Reasoning for Unified Image Quality and Aesthetic Assessment

作者: Wen Yin, Cencen Liu, Dingrui Liu, Bing Su, Yuan-Fang Li, Tao He 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19779v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	8.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	8.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出TATAR框架，使用多模态大语言模型统一图像质量评估(IQA)和图像美学评估(IAA)。核心创新在于任务感知的后训练方法，包括：1) 使用大语言模型作为视觉-语言骨干网络；2) 采用两阶段SFT+GRPO学习（SFT是核心方法）；3) 针对不同任务设计快速-慢速推理策略（涉及CoT和系统2思维）。因此，与"Large Language Models”、“Post-training/SFT”、“Chain of Thought"和"System 2 Thinking"高度相关。其他关键词如MoE、量化、RAG等未涉及。论文属于大模型在计算机视觉领域的应用，符合研究背景要求。

!!! tip deepseek-chat TL;DR

该论文针对统一图像质量评估和美学评估任务中存在的推理和优化不匹配问题，提出了任务感知的TATAR框架，通过共享视觉-语言骨干网络并结合任务特定的后训练方法，在多个基准测试中显著优于现有统一基线模型。

摘要翻译

将图像质量评估（IQA）与图像美学评估（IAA）统一于单一多模态大语言模型中具有显著吸引力，然而现有方法采用任务无关的通用方案，对两项任务应用相同的推理策略与奖励机制。本文指出这种设计存在根本性错配：IQA依赖低层次、客观的感知线索，受益于简洁的失真导向推理；而IAA需要深思熟虑的语义判断，难以通过逐点分数回归有效实现。我们将此归纳为推理错配与优化错配，并通过受控实验为两者提供实证依据。基于这些发现，我们提出TATAR（任务感知的非对称奖励思维框架），该统一框架在共享视觉-语言主干网络的同时，通过后训练阶段的条件化设计适应各任务本质。TATAR包含三个核心组件：快慢双轨任务推理架构——为IQA配备简洁感知依据，为IAA构建深度美学论述；两阶段SFT+GRPO学习——先建立任务感知行为先验，再进行奖励驱动的精细化训练；非对称奖励机制——对IQA采用高斯分数整形，对IAA采用瑟斯顿式完成排序。在八个基准数据集上的大量实验表明，TATAR在域内与跨域场景下均持续超越先前的统一基线模型，与任务专用模型保持竞争力，并在美学评估中呈现更稳定的训练动态。我们的研究确立了任务条件化后训练作为统一感知评分系统的原则性范式。代码已公开于https://github.com/yinwen2019/TATAR。

摘要 (Abstract)

Unifying Image Quality Assessment (IQA) and Image Aesthetic Assessment (IAA) in a single multimodal large language model is appealing, yet existing methods adopt a task-agnostic recipe that applies the same reasoning strategy and reward to both tasks. We show this is fundamentally misaligned: IQA relies on low-level, objective perceptual cues and benefits from concise distortion-focused reasoning, whereas IAA requires deliberative semantic judgment and is poorly served by point-wise score regression. We identify these as a reasoning mismatch and an optimization mismatch, and provide empirical evidence for both through controlled probes. Motivated by these findings, we propose TATAR (Task-Aware Thinking with Asymmetric Rewards), a unified framework that shares the visual-language backbone while conditioning post-training on each task’s nature. TATAR combines three components: fast–slow task-specific reasoning construction that pairs IQA with concise perceptual rationales and IAA with deliberative aesthetic narratives; two-stage SFT+GRPO learning that establishes task-aware behavioral priors before reward-driven refinement; and asymmetric rewards that apply Gaussian score shaping for IQA and Thurstone-style completion ranking for IAA. Extensive experiments across eight benchmarks demonstrate that TATAR consistently outperforms prior unified baselines on both tasks under in-domain and cross-domain settings, remains competitive with task-specific specialized models, and yields more stable training dynamics for aesthetic assessment. Our results establish task-conditioned post-training as a principled paradigm for unified perceptual scoring. Our code is publicly available at https://github.com/yinwen2019/TATAR.

关键词: multimodal large language model, image quality assessment, image aesthetic assessment, task-aware post-training, fast-slow reasoning, SFT+GRPO learning, asymmetric rewards, unified perceptual scoring

159. ❌ ReManNet: A Riemannian Manifold Network for Monocular 3D Lane Detection

作者: Chengzhi Hong, Bijun Li 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19776v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文ReManNet专注于计算机视觉中的单目3D车道检测任务，提出了一种基于黎曼流形网络的新方法。虽然论文涉及深度学习技术，但其核心内容与所有评分关键词（均围绕大模型技术、训练方法、推理优化、对齐、代理系统等）完全无关。论文未提及任何语言模型、训练范式、推理技术或AI for Science的具体应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文解决了单目3D车道检测中因深度模糊和几何约束弱导致的检测不准确问题，通过提出道路流形假设和ReManNet网络，在标准基准测试中实现了最先进的性能。

摘要翻译

单目三维车道线检测因深度歧义与几何约束薄弱而持续面临挑战。主流方法依赖于深度引导、鸟瞰图投影以及基于锚点或曲线头部的简化物理假设，在重映射高维图像特征的同时仅对道路几何进行弱编码。由于车道线与底层路面之间缺乏不变的几何-拓扑耦合，从二维到三维的升维映射是不适定且脆弱的，常导致凹陷、凸起和扭曲等畸变。为解决此问题，我们提出道路流形假设：道路是$\mathbb{R}^3$空间中的平滑二维流形，车道线是嵌入其中的一维子流形，而采样车道点构成密集观测，从而在曲面、曲线与点集之间建立度量与拓扑的耦合关系。基于此，我们提出ReManNet模型：首先通过图像骨干网络与检测头生成初始车道预测，随后在对称正定流形上将几何编码为黎曼高斯描述符，并通过轻量门控机制将这些描述符与视觉特征融合，以保持连贯的三维推理。我们还提出三维隧道车道交并比损失函数，这是一种联合点-曲线的优化目标，通过计算沿每条车道线管状邻域的切片重叠度来提升形状层面的对齐精度。在标准基准上的大量实验表明，ReManNet取得了最先进或具有竞争力的结果。在OpenLane数据集上，其F1分数较基线方法提升+8.2%，较先前最佳方法提升+1.8%，在场景级任务中最高提升达+6.6%。代码将公开于https://github.com/changehome717/ReManNet。

摘要 (Abstract)

Monocular 3D lane detection remains challenging due to depth ambiguity and weak geometric constraints. Mainstream methods rely on depth guidance, BEV projection, and anchor- or curve-based heads with simplified physical assumptions, remapping high-dimensional image features while only weakly encoding road geometry. Lacking an invariant geometric-topological coupling between lanes and the underlying road surface, 2D-to-3D lifting is ill-posed and brittle, often degenerating into concavities, bulges, and twists. To address this, we propose the Road-Manifold Assumption: the road is a smooth 2D manifold in $\mathbb{R}^3$, lanes are embedded 1D submanifolds, and sampled lane points are dense observations, thereby coupling metric and topology across surfaces, curves, and point sets. Building on this, we propose ReManNet, which first produces initial lane predictions with an image backbone and detection heads, then encodes geometry as Riemannian Gaussian descriptors on the symmetric positive-definite (SPD) manifold, and fuses these descriptors with visual features through a lightweight gate to maintain coherent 3D reasoning. We also propose the 3D Tunnel Lane IoU (3D-TLIoU) loss, a joint point-curve objective that computes slice-wise overlap of tubular neighborhoods along each lane to improve shape-level alignment. Extensive experiments on standard benchmarks demonstrate that ReManNet achieves state-of-the-art (SOTA) or competitive results. On OpenLane, it improves F1 by +8.2% over the baseline and by +1.8% over the previous best, with scenario-level gains of up to +6.6%. The code will be publicly available at https://github.com/changehome717/ReManNet.

关键词: Monocular 3D lane detection, Riemannian manifold, Road-Manifold Assumption, SPD manifold, 3D-TLIoU loss, Geometric-topological coupling, State-of-the-art performance, OpenLane benchmark

160. ❌ Evaluating Image Editing with LLMs: A Comprehensive Benchmark and Intermediate-Layer Probing Approach

作者: Shiqi Gao, Zitong Xu, Kang Fu, Huiyu Duan, Xiongkuo Min, Jia wang, Guangtao Zhai 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19775v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心是使用大语言模型（LLMs）评估文本引导的图像编辑方法，提出了TIEdit基准和基于LLM中间层探测的EditProbe评估器。因此，仅与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为论文明确使用LLMs作为评估工具。其他关键词涉及模型架构、训练方法、推理优化、特定应用领域等，论文未涉及，故均为0分。

!!! tip deepseek-chat TL;DR

该论文针对文本引导图像编辑方法评估的挑战，提出了一个包含512张源图像和5120张编辑图像的TIEdit基准，并开发了基于大语言模型中间层探测的EditProbe评估器，实验表明EditProbe比现有自动评估指标更符合人类感知。

摘要翻译

评估文本引导图像编辑方法仍是一个具有挑战性的问题，因为可靠的评估需要同时考量感知质量、与文本指令的匹配度以及原始图像内容的保留程度。尽管TIE模型发展迅速，现有评估基准在规模上仍显不足，且常与人类感知判断相关性较弱。本研究提出TIEdit——一个用于系统评估文本引导图像编辑方法的基准测试集。TIEdit包含512张源图像，涵盖八类代表性编辑任务，每张图像配有编辑指令，并通过十种前沿TIE模型生成了5,120张编辑后图像。为获取可靠的主观评分，我们招募20位专家进行307,200次原始主观评分，最终汇总为三个评估维度（感知质量、编辑匹配度、内容保留度）下的15,360个平均意见分数。除基准测试集外，我们进一步提出EditProbe——一种基于大语言模型的评估器，通过探测隐藏表征的中间层来估计编辑质量。该方法不仅依赖最终模型输出，还从多模态大语言模型的中间层提取信息表征，以更精准捕捉源图像、编辑指令与编辑结果之间的语义及感知关联。实验结果表明，当前广泛使用的自动评估指标在编辑任务中与人类判断的相关性有限，而EditProbe实现了与人类感知显著更强的对齐。TIEdit与EditProbe共同为文本引导图像编辑方法提供了更可靠且符合感知的评估基础。

摘要 (Abstract)

Evaluating text-guided image editing (TIE) methods remains a challenging problem, as reliable assessment should simultaneously consider perceptual quality, alignment with textual instructions, and preservation of original image content. Despite rapid progress in TIE models, existing evaluation benchmarks remain limited in scale and often show weak correlation with human perceptual judgments. In this work, we introduce TIEdit, a benchmark for systematic evaluation of text-guided image editing methods. TIEdit consists of 512 source images paired with editing prompts across eight representative editing tasks, producing 5,120 edited images generated by ten state-of-the-art TIE models. To obtain reliable subjective ratings, 20 experts are recruited to produce 307,200 raw subjective ratings, which accumulates into 15,360 mean opinion scores (MOSs) across three evaluation dimensions: perceptual quality, editing alignment, and content preservation. Beyond the benchmark itself, we further propose EditProbe, an LLM-based evaluator that estimates editing quality via intermediate-layer probing of hidden representations. Instead of relying solely on final model outputs, EditProbe extracts informative representations from intermediate layers of multimodal large language models to better capture semantic and perceptual relationships between source images, editing instructions, and edited results. Experimental results demonstrate that widely used automatic evaluation metrics show limited correlation with human judgments on editing tasks, while EditProbe achieves substantially stronger alignment with human perception. Together, TIEdit and EditProbe provide a foundation for more reliable and perceptually aligned evaluation of text-guided image editing methods.

关键词: text-guided image editing, evaluation benchmark, large language models, intermediate-layer probing, human perception alignment, multimodal LLMs, TIEdit, EditProbe

161. ❌ Template-based Object Detection Using a Foundation Model

作者: Valentin Braeutigam, Matthias Stock, Bernhard Egger 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19773v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究基于基础模型（Foundation Model）的模板对象检测方法，用于图形界面自动化测试。仅与’Large Language Models OR LLMs OR Foundation Models’关键词相关（8分），因为论文明确使用了分割基础模型（segmentation foundation models），属于基础模型在计算机视觉领域的应用。其他关键词涉及大模型技术原理（如MoE、Scaling Laws、训练方法等）、推理能力、代理系统、模型优化等，论文均未涉及。论文属于应用研究而非技术创新，因此相关度较低。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于分割基础模型的模板对象检测方法，用于图形界面自动化测试，在无需训练数据和训练的情况下实现了与YOLO等学习型方法相当的性能。

摘要翻译

当前广泛使用的目标检测方法大多基于学习机制，能够识别不同外观条件下的物体。这类模型通常需要训练过程及相应的训练数据集。本研究聚焦于数据变化较少但要求无需生成训练数据与训练流程的应用场景，例如软件开发过程中图形界面的自动化测试，尤其是持续集成测试。在本方法中，我们利用分割基础模型生成的片段，结合基于特征的简易分类方法。当需要更改待检测目标或其设计时，该方法可节省时间与成本，因为既无需重新训练模型，也无需创建数据集。我们在导航地图图标检测与分类任务上评估了本方法，该任务用于简化和自动化汽车行业用户界面测试。实验结果表明，我们的方法在无需训练的情况下，取得了与YOLO等基于学习的目标检测方法近乎相当的效果。

摘要 (Abstract)

Most currently used object detection methods are learning-based, and can detect objects under varying appearances. Those models require training and a training dataset. We focus on use cases with less data variation, but the requirement of being free of generation of training data and training. Such a setup is for example desired in automatic testing of graphical interfaces during software development, especially for continuous integration testing. In our approach, we use segments from segmentation foundation models and combine them with a simple feature-based classification method. This saves time and cost when changing the object to be searched or its design, as nothing has to be retrained and no dataset has to be created. We evaluate our method on the task of detecting and classifying icons in navigation maps, which is used to simplify and automate the testing of user interfaces in automotive industry. Our methods achieve results almost on par with learning-based object detection methods like YOLO, without the need for training.

关键词: object detection, foundation model, segmentation, template-based, automated testing, graphical interfaces, training-free, computer vision

162. ❌ FlashCap: Millisecond-Accurate Human Motion Capture via Flashing LEDs and Event-Based Vision

作者: Zekai Wu, Shuqi Fan, Mengyin Liu, Yuhua Luo, Xincheng Lin, Ming Yan, Junhao Wu, Xiuhong Lin, Yuexin Ma, Chenglu Wen, Lan Xu, Siqi Shen, Cheng Wang 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19770v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文《FlashCap: Millisecond-Accurate Human Motion Capture via Flashing LEDs and Event-Based Vision》专注于计算机视觉和运动捕捉领域，提出了一种基于闪烁LED和事件视觉的毫秒级精度人体运动捕捉系统，并创建了高时间分辨率数据集FlashMotion。论文的核心技术涉及事件相机、多模态数据融合（事件、RGB、LiDAR、IMU）和残差姿态估计模型ResPose，属于AI在科学（特别是运动分析）中的应用。所有关键词均与大模型、深度学习技术原理、语言模型、训练方法、推理优化、代理系统等无关，因此除“AI for Science OR Bioinformatics OR Cheminformatics”外，其他关键词均得0分。该关键词得5分，因为论文将AI（具体为计算机视觉和机器学习）应用于科学领域（运动分析），符合“AI for Science”的广义范畴，但并非核心聚焦于生物信息学或化学信息学。

!!! tip deepseek-chat TL;DR

该论文解决了高时间分辨率人体运动捕捉的挑战，通过开发FlashCap系统（基于闪烁LED和事件视觉）和FlashMotion数据集，并提出了ResPose模型，实现了毫秒级运动定时精度和约40%的姿态估计误差降低。

摘要翻译

精确运动计时（Precise Motion Timing, PMT）对于快速运动分析至关重要。在体育竞赛中，毫秒级的差异可能决定胜负。尽管人体姿态估计（Human Pose Estimation, HPE）领域已取得显著进展，但由于高时间分辨率标注数据集的稀缺，PMT在很大程度上仍被HPE学界所忽视。目前，PMT仅在奥运会等专业场景中通过高速RGB相机实现；然而，其高昂成本、对光线的敏感性、带宽需求及计算复杂性限制了其在日常应用中的可行性。我们开发了FlashCap，首个基于闪烁LED的运动捕捉系统，专用于PMT。借助FlashCap，我们收集了一个毫秒级分辨率的人体运动数据集FlashMotion，包含事件、RGB、激光雷达和惯性测量单元模态，并通过严格验证证明了其高质量。为评估FlashMotion的价值，我们执行了两项任务：精确运动计时与高时间分辨率HPE。针对这些任务，我们提出了ResPose——一个简单而有效的基线模型，能够基于事件与RGB数据学习残差姿态。实验结果表明，ResPose将姿态估计误差降低了约40%，并实现了毫秒级计时精度，为相关研究开辟了新机遇。数据集与代码将向学界公开。

摘要 (Abstract)

Precise motion timing (PMT) is crucial for swift motion analysis. A millisecond difference may determine victory or defeat in sports competitions. Despite substantial progress in human pose estimation (HPE), PMT remains largely overlooked by the HPE community due to the limited availability of high-temporal-resolution labeled datasets. Today, PMT is achieved using high-speed RGB cameras in specialized scenarios such as the Olympic Games; however, their high costs, light sensitivity, bandwidth, and computational complexity limit their feasibility for daily use. We developed FlashCap, the first flashing LED-based MoCap system for PMT. With FlashCap, we collect a millisecond-resolution human motion dataset, FlashMotion, comprising the event, RGB, LiDAR, and IMU modalities, and demonstrate its high quality through rigorous validation. To evaluate the merits of FlashMotion, we perform two tasks: precise motion timing and high-temporal-resolution HPE. For these tasks, we propose ResPose, a simple yet effective baseline that learns residual poses based on events and RGBs. Experimental results show that ResPose reduces pose estimation errors by ~40% and achieves millisecond-level timing accuracy, enabling new research opportunities. The dataset and code will be shared with the community.

关键词: human motion capture, millisecond accuracy, flashing LEDs, event-based vision, high-temporal-resolution dataset, pose estimation, ResPose, FlashMotion

163. ❌ FREAK: A Fine-grained Hallucination Evaluation Benchmark for Advanced MLLMs

作者: Zhihan Yin, Jianxin Liang, Yueqian Wang, Yifeng Yao, Huishuai Zhang, Dongyan Zhao 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19765v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究多模态大语言模型（MLLMs）的幻觉问题，与’Large Language Models’高度相关（10分）。论文专门研究幻觉评估，与’Hallucination Mitigation’高度相关（10分）。论文系统评估了Chain-of-Thought（CoT）提示技术，与’Chain of Thought’高度相关（10分）。论文涉及模型推理过程分析，与’System 2 Thinking’有一定关联（5分）。论文通过评估揭示模型推理过程，与’Mechanistic Interpretability’有一定关联（5分）。其他关键词如MoE、SLMs、Scaling Laws、训练技术、推理加速、AI for Science等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出了FREAK基准，用于细粒度评估多模态大语言模型在详细视觉感知中的幻觉问题，并通过实验揭示了当前最先进模型存在严重的幻觉问题以及Chain-of-Thought提示技术的局限性。

摘要翻译

多模态大语言模型（MLLMs）普遍存在幻觉问题。现有的幻觉评估基准常受限于任务过于简化导致指标饱和，或多样性不足而无法充分评估前沿多模态模型的幻觉程度。为弥补这一空白，我们提出了FREAK，一个专为MLLMs细粒度幻觉评估设计的综合性多模态基准。通过呈现具有细粒度反常识编辑的高质量逼真图像，FREAK创新性地评估了MLLMs在细节视觉感知中的幻觉现象。在FREAK上的大量实验表明，当前最先进（SOTA）模型在细节视觉感知方面存在严重的幻觉问题。为进行更深入探究，我们构建了一个受控子集，以间接评估模型感知目标细节信息的能力。通过在该任务中对主流思维链（Chain-of-Thought, CoT）提示技术进行系统评估，我们揭示了关于幻觉模式与模型推理过程的关键洞见。

摘要 (Abstract)

Multimodal Large Language Models (MLLMs) suffer from hallucinations. Existing hallucination evaluation benchmarks are often limited by over-simplified tasks leading to saturated metrics, or insufficient diversity that fails to adequately assess the hallucination extent in state-of-the-art multimodal models. To address this gap, we propose FREAK, a comprehensive multimodal benchmark designed for fine-grained hallucination assessment in MLLMs. Through high-quality photorealistic images featuring fine-grained counter-commonsense edits, FREAK innovatively evaluates hallucination phenomena in detailed visual perception of MLLMs. Extensive experiments on FREAK show severe hallucination issues in SOTA models regarding detailed visual perception. To enable deeper investigation, we curate a controlled subset to indirectly evaluate the model’s ability to perceive target detailed information. Through systematic evaluation of prevailing Chain-of-Thought (CoT) prompting techniques within this task, we reveal critical insights regarding hallucination patterns and model reasoning processes.

关键词: Multimodal Large Language Models, Hallucination Evaluation, Fine-grained Assessment, Chain-of-Thought, Visual Perception, Benchmark, MLLMs, Reasoning Processes

164. ❌ Adapting a Pre-trained Single-Cell Foundation Model to Spatial Gene Expression Generation from Histology Images

作者: Donghai Fang, Yongheng Li, Zhen Wang, Yuansong Zeng, Wenwen Min 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19766v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	8.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文研究单细胞基础模型（sc-FM）在空间转录组学中的应用，属于AI for Science（生物信息学）领域，与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。论文核心是适配预训练的单细胞基础模型，与’Pre-training OR Continual Pre-training OR Domain Adaptation’高度相关（10分），并涉及轻量级调制方法SoftAdaLN，与’PEFT OR LoRA OR Parameter-efficient Fine-tuning’有一定关联（8分）。论文提到’foundation models’，与’Large Language Models OR LLMs OR Foundation Models’有一定关联（8分），但非传统LLM。其他关键词如MoE、SLMs、Scaling Laws、SFT、Alignment、RAG、推理加速等与论文内容无关（0分）。

!!! tip deepseek-chat TL;DR

该论文提出HINGE方法，通过轻量级调制适配预训练的单细胞基础模型，从组织学图像生成空间基因表达，在多个数据集上优于现有方法并保持基因关系一致性。

摘要翻译

空间转录组学（ST）能够实现点层级原位表达谱分析，但其高昂成本和有限通量促使研究者尝试直接通过HE染色组织学图像预测基因表达。近期研究探索利用基于分数或流的生成模型，从组织学图像估计基因表达的条件分布，为确定性回归方法提供了灵活的替代方案。然而，现有生成方法大多忽略了对基因间依赖关系的显式建模，削弱了生物学一致性。单细胞基础模型（sc-FMs）通过跨多样细胞群体的预训练，捕获了仅靠组织学无法揭示的关键基因关系。然而，将仅基于表达的sc-FMs应用于以组织学为条件的表达建模面临三重挑战：缺乏视觉处理通路、预训练目标与条件性ST任务不匹配，以及混合细胞ST监督数据的稀缺性。为解决这些挑战，我们提出HINGE（HIstology-coNditioned GEneration）方法，将预训练的sc-FM改造为条件表达生成器，同时最大程度保留其已学习的基因关系。我们通过引入SoftAdaLN实现这一目标——这是一种轻量级、身份初始化的调制模块，可将层级视觉上下文注入模型主干，并结合表达空间的掩码扩散目标与渐进式课程学习策略，以确保目标对齐和训练稳定性。在三个ST数据集上的评估表明，本方法在平均皮尔逊相关系数上优于现有先进基线，并能生成更准确的空间标志物表达模式及更高的成对共表达一致性，为适配预训练sc-FMs实现组织学条件化的空间表达生成提供了实用路径。

摘要 (Abstract)

Spatial transcriptomics (ST) enables spot-level in situ expression profiling, but its high cost and limited throughput motivate predicting expression directly from HE-stained histology. Recent advances explore using score- or flow-based generative models to estimate the conditional distribution of gene expression from histology, offering a flexible alternative to deterministic regression approaches. However, most existing generative approaches omit explicit modeling of gene-gene dependencies, undermining biological coherence. Single-cell foundation models (sc-FMs), pre-trained across diverse cell populations, capture these critical gene relationships that histology alone cannot reveal. Yet, applying expression-only sc-FMs to histology-conditioned expression modeling is nontrivial due to the absence of a visual pathway, a mismatch between their pre-training and conditional ST objectives, and the scarcity of mixed-cell ST supervision. To address these challenges, we propose HINGE (HIstology-coNditioned GEneration), which retrofits a pre-trained sc-FM into a conditional expression generator while mostly preserving its learned gene relationships. We achieve this by introducing SoftAdaLN, a lightweight, identity-initialized modulation that injects layer-wise visual context into the backbone, coupled with an expression-space masked diffusion objective and a warm-start curriculum to ensure objective alignment and training stability. Evaluated on three ST datasets, ours outperforms state-of-the-art baselines on mean Pearson correlation and yields more accurate spatial marker expression patterns and higher pairwise co-expression consistency, establishing a practical route to adapt pre-trained sc-FMs for histology-conditioned spatial expression generation.

关键词: spatial transcriptomics, single-cell foundation model, histology-conditioned generation, gene expression prediction, domain adaptation, parameter-efficient fine-tuning, generative model, bioinformatics

165. ❌ PCSTracker: Long-Term Scene Flow Estimation for Point Cloud Sequences

作者: Min Lin, Gangwei Xu, Xianqi Wang, Yuyi Peng, Xin Yang 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19762v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文PCSTracker专注于点云序列的长期场景流估计，属于计算机视觉和3D运动分析领域。所有评分关键词均与大模型、深度学习技术原理或AI在科学领域的应用直接相关，而本文研究的是点云处理、几何优化和运动估计，未涉及任何大模型技术、训练方法、推理优化、对齐技术、代理系统或科学AI应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了PCSTracker，一个用于点云序列中一致场景流估计的端到端框架，通过迭代几何运动联合优化和时空点轨迹更新模块，解决了长期序列中几何变化、遮挡和误差累积导致的运动不一致问题，在合成和真实数据集上实现了最佳精度并保持实时性能。

摘要翻译

点云场景流估计是长期精细化三维运动分析的基础。然而，现有方法通常局限于两帧配对设定，且随着几何形状演变、遮挡出现以及误差累积，难以在长序列中保持时间一致性。本研究提出PCSTracker，这是首个专为点云序列中一致性场景流估计而设计的端到端框架。具体而言，我们引入了一种迭代式几何运动联合优化模块（Iterative Geometry Motion joint Optimization module, IGMO），该模块显式建模点特征的时间演化，以缓解动态几何变化导致的对应关系不一致问题。此外，我们提出了一种时空点轨迹更新模块（Spatio-Temporal point Trajectory Update module, STTU），利用宽时间上下文推断被遮挡点的合理位置，从而确保连贯的运动估计。为进一步处理长序列，我们采用了一种重叠滑动窗口推理策略，通过跨窗口传播与窗内细化交替进行，有效抑制误差累积并维持稳定的长期运动一致性。在合成数据集PointOdyssey3D和真实世界数据集ADT3D上进行的大量实验表明，PCSTracker在长期场景流估计中取得了最佳精度，并以32.5 FPS的速度保持实时性能，同时相较于基于RGB-D的方法展现出更优越的三维运动理解能力。

摘要 (Abstract)

Point cloud scene flow estimation is fundamental to long-term and fine-grained 3D motion analysis. However, existing methods are typically limited to pairwise settings and struggle to maintain temporal consistency over long sequences as geometry evolves, occlusions emerge, and errors accumulate. In this work, we propose PCSTracker, the first end-to-end framework specifically designed for consistent scene flow estimation in point cloud sequences. Specifically, we introduce an iterative geometry motion joint optimization module (IGMO) that explicitly models the temporal evolution of point features to alleviate correspondence inconsistencies caused by dynamic geometric changes. In addition, a spatio-temporal point trajectory update module (STTU) is proposed to leverage broad temporal context to infer plausible positions for occluded points, ensuring coherent motion estimation. To further handle long sequences, we employ an overlapping sliding-window inference strategy that alternates cross-window propagation and in-window refinement, effectively suppressing error accumulation and maintaining stable long-term motion consistency. Extensive experiments on the synthetic PointOdyssey3D and real-world ADT3D datasets show that PCSTracker achieves the best accuracy in long-term scene flow estimation and maintains real-time performance at 32.5 FPS, while demonstrating superior 3D motion understanding compared to RGB-D-based approaches.

关键词: point cloud scene flow, long-term motion consistency, temporal evolution, occlusion handling, iterative optimization, 3D motion analysis, sliding-window inference

166. ❌ Growing Networks with Autonomous Pruning

作者: Charles De Lambilly, Stefan Duffner 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19759v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	5.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	5.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种用于图像分类的动态网络结构GNAP，通过生长和剪枝机制在训练过程中调整网络大小和参数数量，以实现高精度的稀疏神经网络。该研究与大多数关键词无关，因为这些关键词主要针对大语言模型（LLMs）及其相关技术（如对齐、推理、代理等）。仅与两个关键词有弱关联：1）‘Mixture of Experts OR MoE OR Sparse Models’：论文涉及稀疏模型（sparse models），但并非MoE架构，因此给5分；2）‘Quantization OR Model Compression OR Low-bit Weights’：论文通过剪枝实现模型压缩，但未涉及量化或低比特权重，因此给5分。其他关键词均不相关，因为论文专注于传统卷积神经网络的动态结构优化，而非大模型或深度学习在科学领域的应用。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为GNAP的动态网络结构，通过自主生长和剪枝机制在训练过程中优化网络大小和参数数量，从而在图像分类任务上实现了高精度的稀疏神经网络。

摘要翻译

本文提出了一种用于图像分类的自主剪枝生长网络。与传统卷积神经网络不同，GNAP在训练过程中动态调整网络规模及参数量，旨在以尽可能少的参数实现最佳数据拟合。该机制通过生长与剪枝两种互补过程实现：GNAP初始参数量极少，每当网络训练收敛至饱和点时，系统会周期性地扩展网络规模以增强表达能力；在各生长阶段之间，模型参数在梯度下降的完全自主调控下同步进行分类训练与剪枝操作。生长阶段使GNAP持续提升分类性能，而自主剪枝机制则确保其始终保持参数稀疏性。在多个图像分类基准测试上的实验结果表明，本方法能够训练出兼具极高稀疏性与高精度的神经网络。例如在MNIST数据集上，我们仅用6.2千参数就达到了99.44%的准确率；在CIFAR10数据集上，使用157.8千参数实现了92.2%的准确率。

摘要 (Abstract)

This paper introduces Growing Networks with Autonomous Pruning (GNAP) for image classification. Unlike traditional convolutional neural networks, GNAP change their size, as well as the number of parameters they are using, during training, in order to best fit the data while trying to use as few parameters as possible. This is achieved through two complementary mechanisms: growth and pruning. GNAP start with few parameters, but their size is expanded periodically during training to add more expressive power each time the network has converged to a saturation point. Between these growing phases, model parameters are trained for classification and pruned simultaneously, with complete autonomy by gradient descent. Growing phases allow GNAP to improve their classification performance, while autonomous pruning allows them to keep as few parameters as possible. Experimental results on several image classification benchmarks show that our approach can train extremely sparse neural networks with high accuracy. For example, on MNIST, we achieved 99.44% accuracy with as few as 6.2k parameters, while on CIFAR10, we achieved 92.2\ accuracy with 157.8k parameters.

关键词: Growing Networks, Autonomous Pruning, Sparse Neural Networks, Image Classification, Parameter Efficiency, Dynamic Architecture, Model Compression, Gradient Descent

167. ❌ ReLi3D: Relightable Multi-view 3D Reconstruction with Disentangled Illumination

作者: Jan-Niklas Dihlmann, Mark Boss, Simon Donne, Andreas Engelhardt, Hendrik P. A. Lensch, Varun Jampani 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19753v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文ReLi3D专注于计算机视觉和图形学领域，研究从稀疏多视角图像进行3D重建、材质估计和光照解耦的统一端到端流水线。论文核心贡献在于多视角约束、transformer交叉条件架构、统一双路径预测策略、可微分蒙特卡洛多重重要性采样渲染器以及混合域训练协议。所有关键词均与大语言模型（LLM）、深度学习技术原理创新或AI在科学领域的应用直接相关，但论文主题是3D重建和计算机视觉，而非大模型或深度学习技术原理。唯一略有相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为3D重建可视为AI在科学或工程领域的应用，但论文未明确提及生物信息学或化学信息学，因此给予5分（有一定关联）。其他关键词均与论文内容完全无关，得0分。

!!! tip deepseek-chat TL;DR

论文提出ReLi3D，一个统一的端到端流水线，首次从稀疏多视角图像中同时重建完整3D几何、基于物理的空间变化材质和环境光照，并在不到一秒内完成，实现了近乎即时的可重光照3D资产生成。

摘要翻译

从图像重建三维资产长期以来需要分离的几何重建、材质估计与光照恢复流程，各流程均存在特定局限性与计算开销。本文提出ReLi3D——首个统一的端到端流程，可在不足一秒内从稀疏多视角图像中同步重建完整三维几何、空间变化的物理材质及环境光照。我们的核心见解在于：多视角约束能显著改善材质与光照的解耦效果，这对单图像方法而言本质上仍是病态问题。本方法的关键在于通过Transformer交叉条件架构融合多视角输入，继而采用新颖的统一双路径预测策略：第一路径预测物体结构与外观，第二路径从图像背景或物体反射中预测环境光照。结合可微分蒙特卡洛多重重要性采样渲染器，该设计构建了最优的光照解耦训练流程。此外，通过融合合成PBR数据集与真实世界RGB采集数据的混合域训练方案，我们在几何精度、材质准确性与光照质量方面实现了可泛化的结果。通过将先前分离的重建任务统一至单次前馈计算，本方法实现了近乎即时生成完整、可重光照三维资产的能力。项目页面：https://reli3d.jdihlmann.com/

摘要 (Abstract)

Reconstructing 3D assets from images has long required separate pipelines for geometry reconstruction, material estimation, and illumination recovery, each with distinct limitations and computational overhead. We present ReLi3D, the first unified end-to-end pipeline that simultaneously reconstructs complete 3D geometry, spatially-varying physically-based materials, and environment illumination from sparse multi-view images in under one second. Our key insight is that multi-view constraints can dramatically improve material and illumination disentanglement, a problem that remains fundamentally ill-posed for single-image methods. Key to our approach is the fusion of the multi-view input via a transformer cross-conditioning architecture, followed by a novel unified two-path prediction strategy. The first path predicts the object’s structure and appearance, while the second path predicts the environment illumination from image background or object reflections. This, combined with a differentiable Monte Carlo multiple importance sampling renderer, creates an optimal illumination disentanglement training pipeline. In addition, with our mixed domain training protocol, which combines synthetic PBR datasets with real-world RGB captures, we establish generalizable results in geometry, material accuracy, and illumination quality. By unifying previously separate reconstruction tasks into a single feed-forward pass, we enable near-instantaneous generation of complete, relightable 3D assets. Project Page: https://reli3d.jdihlmann.com/

关键词: 3D reconstruction, multi-view images, material estimation, illumination disentanglement, transformer architecture, Monte Carlo rendering, relightable assets, end-to-end pipeline

168. ❌ PhysNeXt: Next-Generation Dual-Branch Structured Attention Fusion Network for Remote Photoplethysmography Measurement

作者: Junzhe Cao, Bo Zhao, Zhiyi Niu, Dan Guo, Yue Sun, Haochen Liang, Yong Xu, Zitong YU 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19752v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文PhysNeXt专注于远程光电容积描记术（rPPG）的深度学习框架，属于计算机视觉和生物医学信号处理领域。所有关键词（除最后一个外）均与大语言模型（LLM）技术、训练方法、推理优化、对齐、代理系统等直接相关，而本文不涉及任何LLM或通用大模型技术。唯一略有相关的是’AI for Science OR Bioinformatics OR Cheminformatics’，因为rPPG属于生物医学应用，可视为AI在科学（具体是医疗健康）领域的应用，但论文核心是特定CV任务而非通用科学AI，故给5分（有一定关联）。其他关键词完全无关，均给0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为PhysNeXt的双输入深度学习框架，通过联合利用视频帧和时空图表示来增强远程光电容积描记术（rPPG）信号提取的鲁棒性，实验表明其在挑战性条件下实现了更稳定和细粒度的rPPG信号恢复。

摘要翻译

远程光电容积描记术（remote photoplethysmography, rPPG）通过分析面部皮肤因心脏搏动引起的细微颜色变化，实现了心率及其他生命体征的非接触式测量。现有的rPPG方法主要基于两种路径：一是从原始视频进行端到端建模，二是利用中间时空图（spatial-temporal map, STMap）表示。前者保留了完整的时空信息，能够捕捉与心跳相关的微弱信号，但同时也引入了由运动伪影和光照变化带来的显著噪声。后者则将多个面部感兴趣区域的时间颜色变化堆叠为紧凑的二维表示，虽可能损失部分高频细节，但大幅降低了数据量和计算复杂度。为有效整合两种方法的优势，本文提出PhysNeXt——一种双输入深度学习框架，可协同利用视频帧与STMap表示。通过引入时空差分建模单元、跨模态交互模块以及基于结构化注意力的解码器，PhysNeXt协同增强了脉搏信号提取的鲁棒性。实验结果表明，在复杂条件下，PhysNeXt能够实现更稳定、更精细的rPPG信号恢复，验证了视频与STMap表示联合建模的有效性。相关代码将公开发布。

摘要 (Abstract)

Remote photoplethysmography (rPPG) enables contactless measurement of heart rate and other vital signs by analyzing subtle color variations in facial skin induced by cardiac pulsation. Current rPPG methods are mainly based on either end-to-end modeling from raw videos or intermediate spatial-temporal map (STMap) representations. The former preserves complete spatiotemporal information and can capture subtle heartbeat-related signals, but it also introduces substantial noise from motion artifacts and illumination variations. The latter stacks the temporal color changes of multiple facial regions of interest into compact two-dimensional representations, significantly reducing data volume and computational complexity, although some high-frequency details may be lost. To effectively integrate the mutual strengths, we propose PhysNeXt, a dual-input deep learning framework that jointly exploits video frames and STMap representations. By incorporating a spatio-temporal difference modeling unit, a cross-modal interaction module, and a structured attention-based decoder, PhysNeXt collaboratively enhances the robustness of pulse signal extraction. Experimental results demonstrate that PhysNeXt achieves more stable and fine-grained rPPG signal recovery under challenging conditions, validating the effectiveness of joint modeling of video and STMap representations. The codes will be released.

关键词: remote photoplethysmography, rPPG, dual-branch network, spatio-temporal modeling, attention fusion, heart rate measurement, deep learning, video analysis

169. ❌ PerformRecast: Expression and Head Pose Disentanglement for Portrait Video Editing

作者: Jiadong Liang, Bojun Xiong, Jie Tian, Hua Li, Xiao Long, Yong Zheng, Huan Fu 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19731v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉领域的肖像视频编辑，特别是基于3D Morphable Face Model（3DMM）进行面部表情与头部姿态的解耦。虽然论文提到了预训练（pre-training）一个教师模型用于监督，但这与深度学习模型的一般预训练概念（如大语言模型的预训练）仅有微弱关联，因此给5分。其他所有关键词均与大语言模型、推理、对齐、压缩、科学AI等主题完全无关，论文未涉及任何大模型技术或应用，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为PerformRecast的方法，通过改进3DMM模型的关键点变换公式，实现了肖像视频中面部表情与头部姿态的更好解耦，从而能够独立编辑面部表情，并在实验中表现出优于现有方法的可控性和效率。

摘要翻译

本文主要研究基于驱动视频的纯表情肖像视频表演编辑任务，该任务在动画和电影工业中具有重要作用。现有研究大多集中于肖像动画，其目标是根据驱动视频的面部运动对静态肖像图像进行动画化。因此，这些方法在将面部表情与头部姿态旋转解耦方面仍面临挑战，从而缺乏独立编辑面部表情的能力。本文提出PerformRecast，一种通用的纯表情视频编辑方法，致力于重演现有影视动画中的表演。我们方法的核心洞见源于三维可变形人脸模型（3D Morphable Face Model, 3DMM）的特性，该模型使用分离的参数分别建模三维人脸网格的身份、面部表情和头部姿态。为此，我们改进了现有方法中的关键点变换公式，使其更符合3DMM模型，从而实现更好的解耦效果，并为用户提供更细粒度的控制。此外，为避免生成结果中人脸边界区域的对齐偏差，我们将输入肖像图像的面部与非面部区域解耦，并预训练一个教师模型以对它们分别提供监督。大量实验表明，我们的方法能够生成高质量且更忠实于驱动视频的结果，在可控性和效率方面均优于现有方法。我们的代码、数据及训练模型发布于https://youku-aigc.github.io/PerformRecast。

摘要 (Abstract)

This paper primarily investigates the task of expression-only portrait video performance editing based on a driving video, which plays a crucial role in animation and film industries. Most existing research mainly focuses on portrait animation, which aims to animate a static portrait image according to the facial motion from the driving video. As a consequence, it remains challenging for them to disentangle the facial expression from head pose rotation and thus lack the ability to edit facial expression independently. In this paper, we propose PerformRecast, a versatile expression-only video editing method which is dedicated to recast the performance in existing film and animation. The key insight of our method comes from the characteristics of 3D Morphable Face Model (3DMM), which models the face identity, facial expression and head pose of 3D face mesh with separate parameters. Therefore, we improve the keypoints transformation formula in previous methods to make it more consistent with 3DMM model, which achieves a better disentanglement and provides users with much more fine-grained control. Furthermore, to avoid the misalignment around the boundary of face in generated results, we decouple the facial and non-facial regions of input portrait images and pre-train a teacher model to provide separate supervision for them. Extensive experiments show that our method produces high-quality results which are more faithful to the driving video, outperforming existing methods in both controllability and efficiency. Our code, data and trained models are available at https://youku-aigc.github.io/PerformRecast.

关键词: portrait video editing, expression disentanglement, head pose disentanglement, 3D Morphable Face Model, 3DMM, performance recasting, facial animation, video generation

170. ❌ BALM: A Model-Agnostic Framework for Balanced Multimodal Learning under Imbalanced Missing Rates

作者: Phuong-Anh Nguyen, Tien Anh Pham, Duc-Trong Le, Cam-Van Thi Nguyen 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19718v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文BALM专注于多模态学习中的不平衡和缺失数据问题，提出了一种模型无关的框架来平衡特征和梯度学习。所有评分关键词均与大模型、深度学习技术原理或科学AI应用直接相关，而本文研究的是通用的多模态学习框架（应用于情感识别），并未涉及大模型技术、特定训练方法（如预训练、微调、对齐）、推理优化、代理系统、模型压缩或科学领域AI应用。因此，所有关键词得分为0。

!!! tip deepseek-chat TL;DR

该论文针对多模态学习中因不平衡缺失率导致的表示学习和梯度动态扭曲问题，提出了一个模型无关的BALM框架，通过特征校准和梯度重平衡模块，在多种缺失和不平衡设置下增强了多模态情感识别模型的鲁棒性和性能。

摘要翻译

多模态学习常受模态不平衡问题困扰：信息丰富的模态主导优化过程，而较弱或部分缺失的模态贡献有限。这种不平衡在具有不平衡缺失率（IMR）的现实场景中尤为严重——各模态以不同概率缺失，扭曲了表征学习和梯度动态。本文从训练过程视角重新审视该问题，提出一种与模型无关的即插即用框架BALM，以实现IMR下的平衡多模态学习。该框架包含两个互补模块：特征校准模块（FCM）利用全局上下文重新校准单模态特征，以在异构缺失模式间建立共享表征基础；梯度再平衡模块（GRM）通过从分布和空间双重角度调节梯度幅度与方向，平衡跨模态的学习动态。BALM可无缝集成到多种骨干网络（包括多模态情感识别模型）中，无需改变其架构。在多个多模态情感识别基准测试上的实验结果表明，BALM能在不同缺失与不平衡条件下持续增强模型鲁棒性并提升性能。代码发布于：https://github.com/np4s/BALM_CVPR2026.git

摘要 (Abstract)

Learning from multiple modalities often suffers from imbalance, where information-rich modalities dominate optimization while weaker or partially missing modalities contribute less. This imbalance becomes severe in realistic settings with imbalanced missing rates (IMR), where each modality is absent with different probabilities, distorting representation learning and gradient dynamics. We revisit this issue from a training-process perspective and propose BALM, a model-agnostic plug-in framework to achieve balanced multimodal learning under IMR. The framework comprises two complementary modules: the Feature Calibration Module (FCM), which recalibrates unimodal features using global context to establish a shared representation basis across heterogeneous missing patterns; the Gradient Rebalancing Module (GRM), which balances learning dynamics across modalities by modulating gradient magnitudes and directions from both distributional and spatial perspectives. BALM can be seamlessly integrated into diverse backbones, including multimodal emotion recognition (MER) models, without altering their architectures. Experimental results across multiple MER benchmarks confirm that BALM consistently enhances robustness and improves performance under diverse missing and imbalance settings. Code available at: https://github.com/np4s/BALM_CVPR2026.git

关键词: multimodal learning, imbalanced missing rates, feature calibration, gradient rebalancing, model-agnostic framework, multimodal emotion recognition, robustness, representation learning

171. ❌ Demographic-Aware Self-Supervised Anomaly Detection Pretraining for Equitable Rare Cardiac Diagnosis

作者: Chaoqin Huang, Zi Zeng, Aofan Jiang, Yuchen Xu, Qing Cao, Kang Chen, Chenfei Chi, Yanfeng Wang, Ya Zhang 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19695v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	10.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	10.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于AI在生物医学信号（心电图）分析中的应用，属于AI for Science范畴，因此该关键词得10分。论文方法采用两阶段框架：第一阶段是自监督异常检测预训练（Pre-training），第二阶段是使用非对称损失进行微调（Supervised Fine-tuning），这两个关键词高度相关，各得10分。论文提到生成异常分数图进行定位，这涉及模型可解释性，因此Explainable AI得5分。其他关键词主要涉及大语言模型（LLM）的特定技术（如MoE、RLHF、RAG等）、推理方法（如CoT）、代理系统或通用世界模型，论文未涉及这些内容，因此得0分。

!!! tip deepseek-chat TL;DR

该研究针对心电图（ECG）中罕见心脏异常检测的难题，提出了一个结合自监督异常检测预训练和人口统计感知表示学习的AI框架，在超过100万临床ECG数据上实现了94.7%的AUROC，并将常见与罕见异常的性能差距减少了73%，同时在不同年龄和性别组中保持了一致的诊断准确性。

摘要翻译

罕见心脏异常因其长尾分布（病例数极少）且诊断性能存在人口统计学差异，难以通过心电图（ECG）检测。这些局限性导致识别延迟与诊疗质量不均，亟需一种泛化性强的框架，在提升敏感度的同时确保跨人群的公平性。本研究开发了一种人工智能辅助的两阶段心电图分析框架，整合了自监督异常检测与人口统计学感知表征学习。第一阶段通过重建掩蔽的全局与局部心电图信号、建模信号趋势并预测患者属性，进行自监督异常检测预训练，从而在没有诊断标签的情况下学习稳健的心电图表征。随后，该预训练模型通过非对称损失进行多标签心电图分类的微调，以更好地处理长尾心脏异常，并额外生成用于定位的异常评分图；基于CPU的优化使其具备实际部署可行性。在一个包含超百万例临床心电图的纵向队列中评估显示，本方法对罕见异常的受试者工作特征曲线下面积（AUROC）达到94.7%，将常见与罕见异常的性能差距缩小了73%，同时在年龄与性别组间保持一致的诊断准确性。综上所述，所提出的公平感知人工智能框架展现出强大的临床实用性、可解释的异常定位能力以及在多队列中的可扩展性能，凸显了其在减轻诊断差异、推动生物医学信号与数字健康领域公平异常检测方面的潜力。源代码发布于https://github.com/MediaBrain-SJTU/Rare-ECG。

摘要 (Abstract)

Rare cardiac anomalies are difficult to detect from electrocardiograms (ECGs) due to their long-tailed distribution with extremely limited case counts and demographic disparities in diagnostic performance. These limitations contribute to delayed recognition and uneven quality of care, creating an urgent need for a generalizable framework that enhances sensitivity while ensuring equity across diverse populations. In this study, we developed an AI-assisted two-stage ECG framework integrating self-supervised anomaly detection with demographic-aware representation learning. The first stage performs self-supervised anomaly detection pretraining by reconstructing masked global and local ECG signals, modeling signal trends, and predicting patient attributes to learn robust ECG representations without diagnostic labels. The pretrained model is then fine-tuned for multi-label ECG classification using asymmetric loss to better handle long-tail cardiac abnormalities, and additionally produces anomaly score maps for localization, with CPU-based optimization enabling practical deployment. Evaluated on a longitudinal cohort of over one million clinical ECGs, our method achieves an AUROC of 94.7% for rare anomalies and reduces the common-rare performance gap by 73%, while maintaining consistent diagnostic accuracy across age and sex groups. In conclusion, the proposed equity-aware AI framework demonstrates strong clinical utility, interpretable anomaly localization, and scalable performance across multiple cohorts, highlighting its potential to mitigate diagnostic disparities and advance equitable anomaly detection in biomedical signals and digital health. Source code is available at https://github.com/MediaBrain-SJTU/Rare-ECG.

关键词: ECG anomaly detection, self-supervised learning, demographic-aware representation, long-tailed distribution, equitable diagnosis, pretraining, fine-tuning, clinical AI

172. ❌ TSegAgent: Zero-Shot Tooth Segmentation via Geometry-Aware Vision-Language Agents

作者: Shaojie Zhuang, Lu Yin, Guangshun Wei, Yunpeng Li, Xilu Wang, Yuanfeng Zhou 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19684v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	5.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	5.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	5.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文TSegAgent提出了一种基于几何感知视觉-语言代理的零样本牙齿分割方法，核心创新在于将牙齿分割重新定义为几何推理问题，而非纯数据驱动的识别任务。该方法利用通用基础模型（如视觉-语言模型）的表示能力，结合牙科解剖的几何归纳偏置，实现零样本分割。因此，与"LLM Agents"高度相关（10分），因为论文明确使用代理框架进行推理；与"AI for Science"高度相关（10分），属于牙科生物信息学应用；与"Foundation Models"相关（8分），依赖通用基础模型；与"Chain of Thought"和"System 2 Thinking"有一定关联（5分），涉及多步几何推理；与"Tool Use"相关（5分），代理可能调用几何处理工具。其他关键词如MoE、SFT、RAG等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出TSegAgent，通过将牙齿分割重新定义为几何推理问题，结合基础模型和牙科解剖约束，实现了零样本、低成本的牙齿分割和识别，并展现出强大的泛化能力。

摘要翻译

从口内扫描三维模型中自动进行牙齿分割与识别是数字牙科领域的基础性问题，然而现有方法大多依赖于使用密集标注数据集训练的任务专用三维神经网络，导致标注成本高昂且对未见过来源的扫描数据泛化能力有限。为此，我们提出TSegAgent，通过将牙齿分析重新定义为零样本几何推理问题而非纯粹数据驱动的识别任务，以应对这些挑战。其核心思想是将通用基础模型的表征能力与源自牙科解剖学的显式几何归纳偏置相结合。该框架不学习牙齿专用特征，而是利用多视角视觉抽象与基于几何的推理来推断牙齿实例及其身份，无需任务专用训练。通过显式编码牙弓结构及体积关系等结构约束，该方法降低了模糊情况下的不确定性，并减轻了对特定形状分布的过拟合。实验结果表明，这种以推理为导向的构建方式能够以较低的计算与标注成本实现准确可靠的牙齿分割与识别，并在多样化的、先前未见过的牙科扫描数据上展现出强大的泛化能力。

摘要 (Abstract)

Automatic tooth segmentation and identification from intra-oral scanned 3D models are fundamental problems in digital dentistry, yet most existing approaches rely on task-specific 3D neural networks trained with densely annotated datasets, resulting in high annotation cost and limited generalization to scans from unseen sources. Thus, we propose TSegAgent, which addresses these challenges by reformulating dental analysis as a zero-shot geometric reasoning problem rather than a purely data-driven recognition task. The key idea is to combine the representational capacity of general-purpose foundation models with explicit geometric inductive biases derived from dental anatomy. Instead of learning dental-specific features, the proposed framework leverages multi-view visual abstraction and geometry-grounded reasoning to infer tooth instances and identities without task-specific training. By explicitly encoding structural constraints such as dental arch organization and volumetric relationships, the method reduces uncertainty in ambiguous cases and mitigates overfitting to particular shape distributions. Experimental results demonstrate that this reasoning-oriented formulation enables accurate and reliable tooth segmentation and identification with low computational and annotation cost, while exhibiting strong generalization across diverse and previously unseen dental scans.

关键词: zero-shot tooth segmentation, geometry-aware vision-language agents, dental analysis, geometric reasoning, foundation models, 3D models, digital dentistry, generalization

173. ❌ 3D Gaussian Splatting with Self-Constrained Priors for High Fidelity Surface Reconstruction

作者: Takeshi Noda, Yu-Shen Liu, Zhizhong Han 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19682v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉和图形学领域的3D表面重建技术，具体研究3D高斯泼溅（3DGS）的改进方法。论文内容完全不涉及大语言模型（LLM）、深度学习技术原理创新、大模型在不同领域的应用，或任何与自然语言处理、模型训练、对齐、推理、代理等相关的主题。所有评分关键词均与大模型和深度学习技术直接相关，而本文属于纯粹的3D计算机视觉研究，因此所有关键词的相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对3D高斯泼溅（3DGS）在表面重建中精度不足的问题，提出了一种自约束先验方法，通过融合深度图生成TSDF网格来约束3D高斯的学习，从而实现了更准确的高保真表面重建，并在广泛使用的基准测试中超越了现有方法。

摘要翻译

在辐射场建模领域，通过3D高斯溅射（3DGS）或神经辐射场（NeRF）技术，三维表面渲染已实现了革命性进展。尽管3DGS在渲染质量或速度方面已展现出优于NeRF的特性，但其在恢复高保真表面方面仍有改进空间。为解决这一问题，我们提出一种自约束先验方法，用以约束三维高斯函数的学习过程，旨在实现更精确的深度渲染。该自约束先验源自通过融合当前三维高斯模型渲染深度图所得的截断符号距离函数（TSDF）网格。该先验通过测算估计表面周围的距离场，构建一个以表面为中心的约束带，从而对三维高斯函数施加更具体的约束——例如剔除约束带外的高斯单元、推动高斯单元向表面靠拢，并以几何感知方式调控其不透明度增减。更重要的是，我们的先验能通过持续更新的深度图（通常具有更高精度和完整性）进行定期优化。此外，该先验还可逐步收窄约束带以强化约束力度。我们通过理论论证与广泛使用的基准测试评估，验证了本方法的有效性，并证明了其优于当前主流技术的性能表现。

摘要 (Abstract)

Rendering 3D surfaces has been revolutionized within the modeling of radiance fields through either 3DGS or NeRF. Although 3DGS has shown advantages over NeRF in terms of rendering quality or speed, there is still room for improvement in recovering high fidelity surfaces through 3DGS. To resolve this issue, we propose a self-constrained prior to constrain the learning of 3D Gaussians, aiming for more accurate depth rendering. Our self-constrained prior is derived from a TSDF grid that is obtained by fusing the depth maps rendered with current 3D Gaussians. The prior measures a distance field around the estimated surface, offering a band centered at the surface for imposing more specific constraints on 3D Gaussians, such as removing Gaussians outside the band, moving Gaussians closer to the surface, and encouraging larger or smaller opacity in a geometry-aware manner. More importantly, our prior can be regularly updated by the most recent depth images which are usually more accurate and complete. In addition, the prior can also progressively narrow the band to tighten the imposed constraints. We justify our idea and report our superiority over the state-of-the-art methods in evaluations on widely used benchmarks.

关键词: 3D Gaussian Splatting, Surface Reconstruction, Self-Constrained Priors, TSDF Grid, Depth Rendering, High Fidelity, 3D Gaussians, Geometry-aware

174. ❌ Unbiased Dynamic Multimodal Fusion

作者: Shicai Wei, Kaijie Zhang, Luyi Chen, Tao He, Guiduo Duan 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19681v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于多模态融合方法的研究，提出了一种无偏动态多模态学习框架（UDML），通过噪声感知不确定性估计和模态依赖偏差量化来解决动态融合中的问题。论文内容涉及多模态学习、不确定性估计、模态融合等传统机器学习/深度学习领域，但未涉及大语言模型（LLM）、大模型技术原理（如MoE、Scaling Laws、训练对齐方法等）、大模型应用技术（如RAG、推理加速、智能体等）或科学AI应用。所有评分关键词均与大模型或特定科学AI领域相关，而本文研究的是通用的多模态学习方法，与这些关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文针对动态多模态融合中因噪声估计不准和模态依赖偏差导致的性能下降问题，提出了一个无偏动态多模态学习框架，通过噪声感知不确定性估计和偏差量化来提升融合性能，并在多个基准任务上验证了其有效性。

摘要翻译

传统多模态方法通常假设模态质量是静态的，这限制了其在动态现实场景中的适应性。因此，动态多模态方法被提出以评估模态质量并相应调整其贡献。然而，这些方法通常依赖经验性指标，在噪声水平极低或极高时无法有效衡量模态质量。此外，现有方法通常假设各模态的初始贡献相同，忽视了内在的模态依赖性偏差。这导致难以学习的模态会受到双重惩罚，使得动态融合的性能可能反而低于静态融合。为解决这些挑战，我们提出了无偏动态多模态学习（Unbiased Dynamic Multimodal Learning, UDML）框架。具体而言，我们引入了一种噪声感知不确定性估计器，该估计器向模态数据添加受控噪声，并从模态特征中预测其强度。这迫使模型学习特征损坏与噪声水平之间的明确对应关系，从而能够在低噪声和高噪声条件下实现准确的不确定性度量。此外，我们通过模态丢弃技术量化多模态网络内部固有的模态依赖偏差，并将其纳入权重机制中。这消除了对难学习模态的双重抑制效应。在多种多模态基准任务上进行的大量实验验证了所提UDML方法的有效性、多功能性和泛化能力。代码发布于https://github.com/shicaiwei123/UDML。

摘要 (Abstract)

Traditional multimodal methods often assume static modality quality, which limits their adaptability in dynamic real-world scenarios. Thus, dynamical multimodal methods are proposed to assess modality quality and adjust their contribution accordingly. However, they typically rely on empirical metrics, failing to measure the modality quality when noise levels are extremely low or high. Moreover, existing methods usually assume that the initial contribution of each modality is the same, neglecting the intrinsic modality dependency bias. As a result, the modality hard to learn would be doubly penalized, and the performance of dynamical fusion could be inferior to that of static fusion. To address these challenges, we propose the Unbiased Dynamic Multimodal Learning (UDML) framework. Specifically, we introduce a noise-aware uncertainty estimator that adds controlled noise to the modality data and predicts its intensity from the modality feature. This forces the model to learn a clear correspondence between feature corruption and noise level, allowing accurate uncertainty measure across both low- and high-noise conditions. Furthermore, we quantify the inherent modality reliance bias within multimodal networks via modality dropout and incorporate it into the weighting mechanism. This eliminates the dual suppression effect on the hard-to-learn modality. Extensive experiments across diverse multimodal benchmark tasks validate the effectiveness, versatility, and generalizability of the proposed UDML. The code is available at https://github.com/shicaiwei123/UDML.

关键词: multimodal fusion, dynamic fusion, uncertainty estimation, modality quality, noise-aware, modality dependency bias, UDML, multimodal learning

175. ❌ Vision-Language Attribute Disentanglement and Reinforcement for Lifelong Person Re-Identification

作者: Kunlun Xu, Haotong Cheng, Jiangmeng Li, Xu Zou, Jiahuan Zhou 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19678v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究终身行人重识别（LReID），提出了一种基于视觉语言模型（VLM）的方法VLADR，通过多粒度文本属性解耦和跨域跨模态属性增强来提升模型性能。论文主要涉及计算机视觉和跨模态学习领域，与大多数大语言模型（LLM）技术关键词无关。仅与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（5分），因为论文提到利用VLM的预训练知识，并涉及跨域知识迁移（类似领域适应），但并非论文核心创新点。其他关键词均未涉及，故评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对终身行人重识别中细粒度属性知识利用不足的问题，提出了一种视觉语言属性解耦与增强方法（VLADR），通过多粒度文本属性解耦和跨域跨模态对齐，显著提升了模型的抗遗忘和泛化能力。

摘要翻译

终身行人重识别（LReID）旨在从不同域中学习，以获得统一的行人检索模型。现有的LReID方法通常侧重于从零开始学习或基于视觉分类预训练模型进行学习，而视觉语言模型（Vision-Language Model, VLM）已在多种任务中展现出泛化性知识。尽管现有方法可直接适配于VLM，但由于它们仅考虑全局感知学习，细粒度属性知识未能得到充分利用，导致知识获取与抗遗忘能力受限。为解决此问题，我们提出了一种新颖的VLM驱动的LReID方法，称为视觉语言属性解耦与增强（Vision-Language Attribute Disentanglement and Reinforcement, VLADR）。我们的核心思想是显式建模普遍共享的人体属性，以提升跨域知识迁移能力，从而有效利用历史知识来增强新知识学习并缓解遗忘。具体而言，VLADR包含一个多粒度文本属性解耦机制，用于挖掘图像的全局及多样化的局部文本属性。随后，我们设计了跨域跨模态属性增强方案，该方案引入跨模态属性对齐以指导视觉属性提取，并采用跨域属性对齐来实现细粒度知识迁移。实验结果表明，我们的VLADR在抗遗忘能力和泛化能力上分别以1.9%–2.2%和2.1%–2.5%的优势超越了现有最先进方法。源代码发布于https://github.com/zhoujiahuan1991/CVPR2026-VLADR。

摘要 (Abstract)

Lifelong person re-identification (LReID) aims to learn from varying domains to obtain a unified person retrieval model. Existing LReID approaches typically focus on learning from scratch or a visual classification-pretrained model, while the Vision-Language Model (VLM) has shown generalizable knowledge in a variety of tasks. Although existing methods can be directly adapted to the VLM, since they only consider global-aware learning, the fine-grained attribute knowledge is underleveraged, leading to limited acquisition and anti-forgetting capacity. To address this problem, we introduce a novel VLM-driven LReID approach named Vision-Language Attribute Disentanglement and Reinforcement (VLADR). Our key idea is to explicitly model the universally shared human attributes to improve inter-domain knowledge transfer, thereby effectively utilizing historical knowledge to reinforce new knowledge learning and alleviate forgetting. Specifically, VLADR includes a Multi-grain Text Attribute Disentanglement mechanism that mines the global and diverse local text attributes of an image. Then, an Inter-domain Cross-modal Attribute Reinforcement scheme is developed, which introduces cross-modal attribute alignment to guide visual attribute extraction and adopts inter-domain attribute alignment to achieve fine-grained knowledge transfer. Experimental results demonstrate that our VLADR outperforms the state-of-the-art methods by 1.9%-2.2% and 2.1%-2.5% on anti-forgetting and generalization capacity. Our source code is available at https://github.com/zhoujiahuan1991/CVPR2026-VLADR

关键词: Lifelong Person Re-Identification, Vision-Language Model, Attribute Disentanglement, Cross-modal Alignment, Knowledge Transfer, Anti-forgetting, Fine-grained Attributes, Domain Adaptation

176. ❌ DynFlowDrive: Flow-Based Dynamic World Modeling for Autonomous Driving

作者: Xiaolu Liu, Yicong Li, Song Wang, Junbo Chen, Angela Yao, Jianke Zhu 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19675v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文DynFlowDrive专注于自动驾驶领域的世界模型（World Models）研究，提出了一种基于流（flow-based）的动态世界建模方法。论文的核心创新在于使用rectified flow formulation来建模场景状态在不同驾驶动作下的演变，并引入稳定性感知的多模式轨迹选择策略。论文与绝大多数关键词（如LLMs、MoE、RLHF、RAG、CoT等）完全无关，因为这些关键词主要涉及大语言模型、训练技术、推理方法等，而本文研究的是自动驾驶的专用世界模型。唯一相关的关键词是’World Models AND General World Models’，因为论文明确研究世界模型在自动驾驶中的应用，但并非通用世界模型，因此评分为10分（高度相关，但非通用）。

!!! tip deepseek-chat TL;DR

该论文针对自动驾驶中现有世界模型难以可靠预测轨迹条件场景演变的问题，提出了一种基于流动态的潜在世界模型DynFlowDrive，通过建模场景状态在驾驶动作下的速度场来实现渐进式未来状态预测，并引入稳定性感知的轨迹选择策略，在nuScenes和NavSim基准测试中取得了持续改进且不增加推理开销。

摘要翻译

近年来，世界模型被引入自动驾驶系统以提升规划可靠性。现有方法通常通过外观生成或确定性回归来预测未来状态，这限制了其捕捉轨迹条件化场景演化的能力，并导致动作规划不可靠。为解决这一问题，我们提出DynFlowDrive，一种基于流动态的潜在世界模型，用于建模不同驾驶动作下世界状态的转移。通过采用修正流（rectified flow）公式，该模型学习了一个速度场，用于描述场景状态在不同驾驶动作下的变化规律，从而实现对未来潜在状态的渐进式预测。在此基础上，我们进一步提出一种稳定性感知的多模态轨迹选择策略，该策略根据场景转移的稳定性评估候选轨迹。在nuScenes和NavSim基准测试上的大量实验表明，该方法能在不引入额外推理开销的情况下，在不同驾驶框架中实现一致的性能提升。源代码将在https://github.com/xiaolul2/DynFlowDrive公开。

摘要 (Abstract)

Recently, world models have been incorporated into the autonomous driving systems to improve the planning reliability. Existing approaches typically predict future states through appearance generation or deterministic regression, which limits their ability to capture trajectory-conditioned scene evolution and leads to unreliable action planning. To address this, we propose DynFlowDrive, a latent world model that leverages flow-based dynamics to model the transition of world states under different driving actions. By adopting the rectifiedflow formulation, the model learns a velocity field that describes how the scene state changes under different driving actions, enabling progressive prediction of future latent states. Building upon this, we further introduce a stability-aware multi-mode trajectory selection strategy that evaluates candidate trajectories according to the stability of the induced scene transitions. Extensive experiments on the nuScenes and NavSim benchmarks demonstrate consistent improvements across diverse driving frameworks without introducing additional inference overhead. Source code will be abaliable at https://github.com/xiaolul2/DynFlowDrive.

关键词: World Models, Autonomous Driving, Flow-based Dynamics, Trajectory Planning, Scene Evolution, Stability-aware Selection, Latent State Prediction, Rectified Flow

177. ❌ Making Video Models Adhere to User Intent with Minor Adjustments

作者: Daniel Ajisafe, Eric Hedlin, Helge Rhodin, Kwang Moo Yi 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19672v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是文本到视频扩散模型的用户意图控制问题，通过优化边界框来改进生成质量和控制输入的一致性。虽然论文涉及AI技术（视频扩散模型），但所有评分关键词都专门针对大语言模型（LLMs）及其相关技术（如MoE、SFT、RLHF、RAG、量化等），而论文完全不涉及语言模型或文本生成技术。论文专注于计算机视觉领域的视频生成和控制方法，与评分关键词中的大模型技术原理、训练方法、推理优化、代理系统等均无直接关联。

!!! tip deepseek-chat TL;DR

该论文解决了文本到视频扩散模型中用户通过边界框控制生成时的一致性问题，通过优化边界框位置使其与模型内部注意力图对齐，从而显著提高了生成质量和控制输入的一致性。

摘要翻译

随着文本到视频扩散模型近期取得显著进展，对其生成过程进行控制已引起广泛关注。一种常用的控制方式是通过边界框或布局来实现。然而，如何确保生成内容严格遵循这些控制输入仍是一个待解决的问题。在本研究中，我们发现通过对用户提供的边界框进行微调，既能提升生成质量，也能增强对控制输入的遵循度。这一目标仅需通过优化边界框，使其更好地与视频扩散模型的内部注意力图对齐，同时仔细平衡前景与背景的聚焦程度即可实现。从某种意义上说，我们是在将边界框调整至模型更熟悉的位置。令人惊讶的是，即使仅进行微小调整，生成质量也可能产生显著变化。为此，我们提出了一种可微分的平滑掩码方法，使边界框位置可优化，并设计了一种注意力最大化目标函数来调整边界框。我们进行了全面的实验，包括用户研究以验证方法的有效性。相关代码已在项目网页公开，以促进学术界的后续研究。

摘要 (Abstract)

With the recent drastic advancements in text-to-video diffusion models, controlling their generations has drawn interest. A popular way for control is through bounding boxes or layouts. However, enforcing adherence to these control inputs is still an open problem. In this work, we show that by slightly adjusting user-provided bounding boxes we can improve both the quality of generations and the adherence to the control inputs. This is achieved by simply optimizing the bounding boxes to better align with the internal attention maps of the video diffusion model while carefully balancing the focus on foreground and background. In a sense, we are modifying the bounding boxes to be at places where the model is familiar with. Surprisingly, we find that even with small modifications, the quality of generations can vary significantly. To do so, we propose a smooth mask to make the bounding box position differentiable and an attention-maximization objective that we use to alter the bounding boxes. We conduct thorough experiments, including a user study to validate the effectiveness of our method. Our code is made available on the project webpage to foster future research from the community.

关键词: text-to-video diffusion models, bounding box control, attention maps, generation quality, user intent adherence, optimization, video generation, control inputs

作者: Yichen Zeng, Hebaixu Wang, Meng Liu, Yu Zhou, Chen Gao, Kehan Chen, Gongping Huang 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19660v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究音频-视觉导航，提出MAGNet模型解决连续环境中目标间歇性静音的问题，属于具身智能和机器人导航领域。所有评分关键词均专注于大语言模型（LLM）及相关技术（如MoE、RLHF、RAG、量化等），或特定AI科学应用（如生物信息学）。论文未涉及任何LLM技术、模型架构创新或AI科学应用，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

该论文提出了SAVN-CE任务和MAGNet模型，解决了在连续3D环境中音频-视觉导航时目标间歇性静音导致目标信息丢失的挑战，实验表明MAGNet显著优于现有方法，成功率绝对提升达12.1%。

摘要翻译

视听导航使具身智能体能够利用听觉与视觉线索，向发声目标移动。然而，现有方法大多依赖预计算的房间脉冲响应进行双耳音频渲染，将智能体限制在离散网格位置，导致空间不连续的观测。为构建更真实的设定，我们提出连续环境中的语义视听导航，使智能体可在三维空间中自由移动，并感知时空连贯的视听流。在此设定下，目标可能间歇性静默或完全停止发声，导致智能体丢失目标信息。为应对这一挑战，我们提出MAGNet——一种基于多模态Transformer的模型，该模型联合编码空间与语义目标表征，并将历史上下文与自运动线索相融合，以实现记忆增强的目标推理。综合实验表明，MAGNet显著优于现有先进方法，在成功率上实现最高12.1%的绝对提升。这些结果同时凸显了其对短时声音和远距离导航场景的鲁棒性。代码发布于https://github.com/yichenzeng24/SAVN-CE。

摘要 (Abstract)

Audio-visual navigation enables embodied agents to navigate toward sound-emitting targets by leveraging both auditory and visual cues. However, most existing approaches rely on precomputed room impulse responses (RIRs) for binaural audio rendering, restricting agents to discrete grid positions and leading to spatially discontinuous observations. To establish a more realistic setting, we introduce Semantic Audio-Visual Navigation in Continuous Environments (SAVN-CE), where agents can move freely in 3D spaces and perceive temporally and spatially coherent audio-visual streams. In this setting, targets may intermittently become silent or stop emitting sound entirely, causing agents to lose goal information. To tackle this challenge, we propose MAGNet, a multimodal transformer-based model that jointly encodes spatial and semantic goal representations and integrates historical context with self-motion cues to enable memory-augmented goal reasoning. Comprehensive experiments demonstrate that MAGNet significantly outperforms state-of-the-art methods, achieving up to a 12.1% absolute improvement in success rate. These results also highlight its robustness to short-duration sounds and long-distance navigation scenarios. The code is available at https://github.com/yichenzeng24/SAVN-CE.

关键词: Audio-visual navigation, Continuous environments, Embodied agents, Multimodal transformer, Memory-augmented goal reasoning, Semantic goal representations, Self-motion cues, SAVN-CE

179. ❌ CS-MUNet: A Channel-Spatial Dual-Stream Mamba Network for Multi-Organ Segmentation

作者: Yuyang Zheng, Mingda Zhang, Jianglong Qin, Qi Mo, Jingdan Pan, Haozhe Hu, Hongyi Huang 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19659v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文CS-MUNet专注于医学图像分割，提出了一种基于Mamba架构的通道-空间双流网络用于腹部多器官分割。论文的核心是改进Mamba（一种状态空间模型）在医学图像分割中的应用，通过边界感知状态Mamba模块和通道Mamba状态聚合模块来提升分割性能。所有关键词中，只有"AI for Science OR Bioinformatics OR Cheminformatics"与论文高度相关，因为该论文属于AI在生物医学（具体是医学图像分析）领域的应用，符合"AI for Science"的范畴。其他关键词均涉及大语言模型（LLMs）及其相关技术（如训练、对齐、推理、代理等），而本文研究的是Mamba（一种状态空间模型）在计算机视觉任务中的应用，与大语言模型无直接关联，因此相关度为0。

!!! tip deepseek-chat TL;DR

该论文针对现有Mamba方法在腹部器官分割中忽视跨通道解剖语义协作和缺乏明确边界感知特征融合机制的问题，提出了CS-MUNet，通过边界感知状态Mamba模块和通道Mamba状态聚合模块，在公开基准测试中超越了现有方法，为多器官分割建立了一种新的状态空间模型建模范式。

摘要翻译

近期基于Mamba架构的方法在腹部器官分割领域展现出潜力。然而，现有方法忽视了跨通道解剖语义协作，且缺乏显式的边界感知特征融合机制。为应对这些局限，我们提出CS-MUNet模型，其包含两个专门设计的模块。边界感知状态Mamba模块采用贝叶斯注意力框架生成像素级边界后验图，直接注入Mamba核心扫描参数，从而将边界感知嵌入SSM状态转移机制；同时双分支权重分配机制实现了全局与局部结构表征间的互补调制。通道Mamba状态聚合模块将通道维度重新定义为SSM序列维度，以数据驱动的方式显式建模跨通道解剖语义协作。在两个公开基准数据集上的实验表明，CS-MUNet在多项指标上持续优于现有先进方法，确立了一种新的SSM建模范式，可协同解决腹部多器官分割中的通道语义协作与边界感知特征融合问题。

摘要 (Abstract)

Recently Mamba-based methods have shown promise in abdominal organ segmentation. However, existing approaches neglect cross-channel anatomical semantic collaboration and lack explicit boundary-aware feature fusion mechanisms. To address these limitations, we propose CS-MUNet with two purpose-built modules. The Boundary-Aware State Mamba module employs a Bayesian-attention framework to generate pixel-level boundary posterior maps, injected directly into Mamba’s core scan parameters to embed boundary awareness into the SSM state transition mechanism, while dual-branch weight allocation enables complementary modulation between global and local structural representations. The Channel Mamba State Aggregation module redefines the channel dimension as the SSM sequence dimension to explicitly model cross-channel anatomical semantic collaboration in a data-driven manner. Experiments on two public benchmarks demonstrate that CS-MUNet consistently outperforms state-of-the-art methods across multiple metrics, establishing a new SSM modeling paradigm that jointly addresses channel semantic collaboration and boundary-aware feature fusion for abdominal multi-organ segmentation.

关键词: Mamba, State Space Models, Multi-organ Segmentation, Medical Image Segmentation, Boundary Awareness, Channel-Spatial Dual-Stream, Abdominal Organ Segmentation, SSM Modeling

180. ❌ GravCal: Single-Image Calibration of IMU Gravity Priors with Per-Sample Confidence

作者: Haichao Zhu, Qian Zhang 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19654v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文GravCal专注于计算机视觉和机器人领域的单图像重力校准问题，使用前馈模型校正IMU重力先验，与所有评分关键词（均涉及大模型、深度学习技术原理、AI科学应用等）完全无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为GravCal的前馈模型，用于从单张RGB图像校正噪声IMU重力先验，并预测每个样本的置信度，实验显示其将平均角度误差从22.02°降低至14.24°。

摘要翻译

重力估计是视觉惯性感知、增强现实与机器人技术的基础，然而在存在线性加速度、振动及瞬态运动的情况下，来自惯性测量单元（IMU）的重力先验信息往往不可靠。现有方法通常直接从图像中估计重力，或假设惯性输入具有合理精度，对于如何通过单张图像校正含噪声重力先验这一实际问题，目前尚未得到充分解决。
本文提出GravCal——一种用于单图像重力先验校准的前馈模型。给定一张RGB图像和一个含噪声的重力先验，GravCal能够预测校正后的重力方向及逐样本置信度分数。该模型融合了两种互补的预测结果：包括对输入先验的残差校正，以及独立于先验的图像估计，并通过一个可学习的门控机制实现自适应融合。
大量实验表明，相较于原始惯性先验，GravCal取得了显著提升：它将平均角度误差从22.02°（IMU先验）降低至14.24°，且当先验信息严重失真时改进更为明显。我们还引入了一个新颖的数据集，包含超过14.8万帧图像，涵盖多样化场景和任意相机朝向，每帧均配有通过视觉惯性里程计（VIO）推导的真实重力数据以及经Mahony滤波器处理的IMU先验。可学习的门控机制输出与先验质量具有相关性，使其可作为下游系统有效的置信度参考信号。

摘要 (Abstract)

Gravity estimation is fundamental to visual-inertial perception, augmented reality, and robotics, yet gravity priors from IMUs are often unreliable under linear acceleration, vibration, and transient motion. Existing methods often estimate gravity directly from images or assume reasonably accurate inertial input, leaving the practical problem of correcting a noisy gravity prior from a single image largely unaddressed. We present GravCal, a feedforward model for single-image gravity prior calibration. Given one RGB image and a noisy gravity prior, GravCal predicts a corrected gravity direction and a per-sample confidence score. The model combines two complementary predictions, including a residual correction of the input prior and a prior-independent image estimate, and uses a learned gate to fuse them adaptively. Extensive experiments show strong gains over raw inertial priors: GravCal reduces mean angular error from 22.02° (IMU prior) to 14.24°, with larger improvements when the prior is severely corrupted. We also introduce a novel dataset of over 148K frames with paired VIO-derived ground-truth gravity and Mahony-filter IMU priors across diverse scenes and arbitrary camera orientations. The learned gate also correlates with prior quality, making it a useful confidence signal for downstream systems.

关键词: gravity estimation, single-image calibration, IMU gravity prior, feedforward model, per-sample confidence, visual-inertial perception, robotics, augmented reality

181. ❌ UniBioTransfer: A Unified Framework for Multiple Biometrics Transfer

作者: Caiyi Sun, Yujing Sun, Xiangyu Li, Yuhang Zheng, Yiming Ren, Jiamin Wang, Yuexin Ma, Siu-Ming Yiu 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19637v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	10.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文UniBioTransfer提出了一种用于多任务深度人脸生成的统一框架，其核心创新是BioMoE——一种基于混合专家（MoE）的模型，用于缓解跨任务干扰。因此，与’Mixture of Experts OR MoE OR Sparse Models’高度相关（10分）。论文属于计算机视觉和生成模型在生物特征（人脸）处理中的应用，与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），但并非其核心生物信息学或化学信息学领域。论文未涉及大语言模型（LLMs）、模型训练技术（如预训练、微调、对齐）、推理优化、智能体、可解释性等其他关键词，因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个名为UniBioTransfer的统一框架，通过创新的BioMoE（混合专家）模型和两阶段训练策略，解决了多任务深度人脸生成中的数据稀缺和跨任务冲突问题，并在多种任务上超越了现有方法。

摘要翻译

深度人脸生成传统上遵循任务驱动的范式，即通过特定任务的模型（如人脸迁移和头发迁移）分别处理不同任务。然而，这种单任务设定严重限制了模型的泛化能力和可扩展性。能够单次处理多种深度人脸生成任务的统一模型代表了一个前景广阔且实用的方向，但由于数据稀缺以及异构属性变换引发的跨任务冲突，实现这一目标仍具挑战性。为此，我们提出了UniBioTransfer，这是首个能够同时处理传统深度人脸任务（如人脸迁移和人脸重演）以及形状变化转换（如头发迁移和头部迁移）的统一框架。此外，UniBioTransfer能够自然地泛化至未见任务（如嘴唇、眼睛和眼镜迁移），仅需极少微调。总体而言，UniBioTransfer通过统一的数据构建策略（包括一个专为头发等空间动态属性设计的基于交换的破坏机制）解决了多任务生成中的数据不足问题。它进一步通过创新的BioMoE（一种基于专家混合的模型，结合新颖的两阶段训练策略）缓解了跨任务干扰，从而有效解耦了任务特定知识。大量实验证明了UniBioTransfer在广泛深度人脸生成任务中的有效性、泛化能力和可扩展性，其性能超越了现有的统一模型和任务特定方法。项目页面位于 https://scy639.github.io/UniBioTransfer.github.io/

摘要 (Abstract)

Deepface generation has traditionally followed a task-driven paradigm, where distinct tasks (e.g., face transfer and hair transfer) are addressed by task-specific models. Nevertheless, this single-task setting severely limits model generalization and scalability. A unified model capable of solving multiple deepface generation tasks in a single pass represents a promising and practical direction, yet remains challenging due to data scarcity and cross-task conflicts arising from heterogeneous attribute transformations. To this end, we propose UniBioTransfer, the first unified framework capable of handling both conventional deepface tasks (e.g., face transfer and face reenactment) and shape-varying transformations (e.g., hair transfer and head transfer). Besides, UniBioTransfer naturally generalizes to unseen tasks, like lip, eye, and glasses transfer, with minimal fine-tuning. Generally, UniBioTransfer addresses data insufficiency in multi-task generation through a unified data construction strategy, including a swapping-based corruption mechanism designed for spatially dynamic attributes like hair. It further mitigates cross-task interference via an innovative BioMoE, a mixture-of-experts based model coupled with a novel two-stage training strategy that effectively disentangles task-specific knowledge. Extensive experiments demonstrate the effectiveness, generalization, and scalability of UniBioTransfer, outperforming both existing unified models and task-specific methods across a wide range of deepface generation tasks. Project page is at https://scy639.github.io/UniBioTransfer.github.io/

关键词: UniBioTransfer, deepface generation, multi-task learning, Mixture of Experts (MoE), BioMoE, face transfer, hair transfer, unified framework

182. ❌ IUP-Pose: Decoupled Iterative Uncertainty Propagation for Real-time Relative Pose Regression via Implicit Dense Alignment v1

作者: Jun Wang, Xiaoyan Huang 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19625v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《IUP-Pose: Decoupled Iterative Uncertainty Propagation for Real-time Relative Pose Regression via Implicit Dense Alignment v1》专注于计算机视觉中的相对姿态估计问题，提出了一种基于几何驱动的解耦迭代框架，用于实时SLAM、视觉定位和3D重建。论文的核心技术涉及多头部双向交叉注意力模块、旋转-平移解耦管道、不确定性传播和隐式密集对齐，这些内容与所有评分关键词（均围绕大模型、深度学习技术原理及其在科学领域的应用）完全无关。论文未涉及任何大模型、语言模型、对齐、微调、推理加速、代理系统等主题，也未应用于生物信息学或化学信息学等科学领域。因此，所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文解决了相对姿态估计中特征匹配方法不可微分与ViT回归器计算成本高的权衡问题，提出了一种基于隐式密集对齐的几何驱动解耦迭代框架IUP-Pose，在MegaDepth1500上实现了73.3% AUC@20deg的精度、70 FPS的吞吐量和仅37M参数，为实时边缘部署提供了优越的精度-效率权衡。

摘要翻译

相对位姿估计是SLAM、视觉定位与三维重建的基础任务。现有相对位姿回归方法面临关键权衡：基于特征匹配的流程虽能实现高精度，但通过不可微的RANSAC阻断了梯度流；而基于ViT的回归器虽可端到端训练，但其计算成本过高，难以满足实时部署需求。我们认为其核心瓶颈在于旋转与平移估计的耦合性以及跨视图特征对齐不足。为此，我们提出IUP-Pose——一种具有隐式密集对齐能力的几何驱动解耦迭代框架。轻量级多头双向交叉注意力模块无需显式匹配监督即可实现跨视图特征对齐。对齐后的特征通过解耦的旋转-平移流程处理：两个共享参数的旋转阶段通过不确定性迭代优化旋转估计，随后特征图通过旋转单应性矩阵H_inf进行重对齐，再预测平移向量。IUP-Pose在MegaDepth1500数据集上达到73.3%的AUC@20deg精度，具备完全端到端可微性，以70 FPS的吞吐量和仅37M参数量，展现了适用于实时边缘部署的优异精度-效率平衡。

摘要 (Abstract)

Relative pose estimation is fundamental for SLAM, visual localization, and 3D reconstruction. Existing Relative Pose Regression (RPR) methods face a key trade-off: feature-matching pipelines achieve high accuracy but block gradient flow via non-differentiable RANSAC, while ViT-based regressors are end-to-end trainable but prohibitively expensive for real-time deployment. We identify the core bottlenecks as the coupling between rotation and translation estimation and insufficient cross-view feature alignment. We propose IUP-Pose, a geometry-driven decoupled iterative framework with implicit dense alignment. A lightweight Multi-Head Bi-Cross Attention (MHBC) module aligns cross-view features without explicit matching supervision. The aligned features are processed by a decoupled rotation-translation pipeline: two shared-parameter rotation stages iteratively refine rotation with uncertainty, and feature maps are realigned via rotational homography H_inf before translation prediction. IUP-Pose achieves 73.3% AUC@20deg on MegaDepth1500 with full end-to-end differentiability, 70 FPS throughput, and only 37M parameters, demonstrating a favorable accuracy-efficiency trade-off for real-time edge deployment.

关键词: Relative Pose Regression, Implicit Dense Alignment, Decoupled Iterative Framework, Multi-Head Bi-Cross Attention, Uncertainty Propagation, Real-time Deployment, SLAM, 3D Reconstruction

183. ❌ Disentangle-then-Align: Non-Iterative Hybrid Multimodal Image Registration via Cross-Scale Feature Disentanglement

作者: Chunlei Zhang, Jiahao Xia, Yun Xiao, Bo Jiang, Jian Zhang 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19623v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于多模态图像配准的计算机视觉任务，提出了一种结合特征解缠和混合参数预测的深度学习方法。论文内容与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐技术等）完全无关，因为这些关键词主要针对自然语言处理领域的大语言模型技术。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为医学图像配准可视为AI在科学（特别是生物医学）领域的应用，但论文本身并未明确强调科学应用背景，只是作为计算机视觉方法在医学图像数据集上验证，因此给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该论文提出了一种非迭代的混合多模态图像配准网络HRNet，通过跨尺度特征解缠学习稳定的共享特征空间，并联合预测全局刚性参数和局部变形场，在多个数据集上实现了最先进的配准性能。

摘要翻译

多模态图像配准是一项基础任务，也是下游跨模态分析的前提。尽管在共享特征提取与多尺度架构方面已取得进展，当前方法仍存在两个关键局限：其一，部分方法虽利用解耦学习共享特征，但主要对共享部分进行正则化，导致模态私有信息泄露至共享空间；其二，多数多尺度框架仅支持单一变换类型，当全局错位与局部形变并存时适用性受限。为解决这些问题，我们将混合多模态配准形式化为联合学习稳定共享特征空间与统一混合变换的任务。基于此视角，我们提出HRNet——一种将表征解耦与混合参数预测相耦合的混合配准网络。该网络采用配备模态特定批归一化（MSBN）的共享主干提取多尺度特征，同时通过跨尺度解耦与自适应投影（CDAP）模块抑制模态私有信息，并将共享特征投影至稳定子空间以进行匹配。在此共享空间基础上，混合参数预测模块（HPPM）以非迭代式由粗到细的方式同步估计全局刚性参数与形变场，进而融合为连贯的形变场。在四个多模态数据集上的大量实验表明，本方法在刚性及非刚性配准任务中均达到最先进性能。代码已在项目网站公开。

摘要 (Abstract)

Multimodal image registration is a fundamental task and a prerequisite for downstream cross-modal analysis. Despite recent progress in shared feature extraction and multi-scale architectures, two key limitations remain. First, some methods use disentanglement to learn shared features but mainly regularize the shared part, allowing modality-private cues to leak into the shared space. Second, most multi-scale frameworks support only a single transformation type, limiting their applicability when global misalignment and local deformation coexist. To address these issues, we formulate hybrid multimodal registration as jointly learning a stable shared feature space and a unified hybrid transformation. Based on this view, we propose HRNet, a Hybrid Registration Network that couples representation disentanglement with hybrid parameter prediction. A shared backbone with Modality-Specific Batch Normalization (MSBN) extracts multi-scale features, while a Cross-scale Disentanglement and Adaptive Projection (CDAP) module suppresses modality-private cues and projects shared features into a stable subspace for matching. Built on this shared space, a Hybrid Parameter Prediction Module (HPPM) performs non-iterative coarse-to-fine estimation of global rigid parameters and deformation fields, which are fused into a coherent deformation field. Extensive experiments on four multimodal datasets demonstrate state-of-the-art performance on rigid and non-rigid registration tasks. The code is available at the project website.

关键词: multimodal image registration, feature disentanglement, hybrid transformation, non-iterative registration, cross-scale features, modality-specific normalization, deformation field estimation, medical image analysis

184. ❌ UniPR: Unified Object-level Real-to-Sim Perception and Reconstruction from a Single Stereo Pair

作者: Chuanrui Zhang, Yingshuang Zou, ZhengXian Wu, Yonggen Ling, Yuxiao Yang, Ziwei Wang 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19616v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文UniPR专注于计算机视觉和机器人学中的物体感知与重建任务，提出了一种端到端的立体视觉框架，用于从单对立体图像中并行重建场景中的所有物体。其核心贡献在于几何约束、姿态感知形状表示和大规模数据集构建。所有评分关键词均涉及大语言模型（LLMs）及其相关技术（如训练、对齐、推理优化、代理系统等），或特定科学AI应用（如生物信息学）。该论文完全不涉及任何语言模型、深度学习技术原理创新或大模型在不同领域的应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为UniPR的端到端物体级真实到仿真感知与重建框架，通过单对立体图像利用几何约束解决尺度模糊问题，并构建大规模数据集LVS6D，实现了高效、准确的并行物体重建，适用于机器人应用。

摘要翻译

感知与重建图像中的物体对于虚实迁移任务至关重要，该任务在机器人领域应用广泛。现有方法依赖检测、分割、形状重建和姿态估计等多个子模块来完成流程，但这种模块化流程存在效率低下和误差累积的问题，因为每个阶段仅处理局部或经局部优化的信息，而丢弃了全局上下文。为克服这些局限，我们提出了首个端到端的物体级虚实感知与重建框架UniPR。该框架直接处理单对立体图像，利用几何约束解决尺度模糊性问题。我们引入了姿态感知形状表示方法，以消除对每类物体规范定义的需求，并弥合重建任务与姿态估计任务之间的鸿沟。此外，我们构建了一个大规模立体数据集LVS6D，包含超过6,300个物体，以推动该领域的大规模研究。大量实验表明，UniPR可在单次前向传播中并行重建场景中的所有物体，实现了显著的效率提升，并在不同物体类型间保持了真实的物理比例，凸显了其在机器人实际应用中的潜力。

摘要 (Abstract)

Perceiving and reconstructing objects from images are critical for real-to-sim transfer tasks, which are widely used in the robotics community. Existing methods rely on multiple submodules such as detection, segmentation, shape reconstruction, and pose estimation to complete the pipeline. However, such modular pipelines suffer from inefficiency and cumulative error, as each stage operates on only partial or locally refined information while discarding global context. To address these limitations, we propose UniPR, the first end-to-end object-level real-to-sim perception and reconstruction framework. Operating directly on a single stereo image pair, UniPR leverages geometric constraints to resolve the scale ambiguity. We introduce Pose-Aware Shape Representation to eliminate the need for per-category canonical definitions and to bridge the gap between reconstruction and pose estimation tasks. Furthermore, we construct a large-vocabulary stereo dataset, LVS6D, comprising over 6,300 objects, to facilitate large-scale research in this area. Extensive experiments demonstrate that UniPR reconstructs all objects in a scene in parallel within a single forward pass, achieving significant efficiency gains and preserves true physical proportions across diverse object types, highlighting its potential for practical robotic applications.

关键词: object-level perception, real-to-sim reconstruction, stereo vision, end-to-end framework, pose-aware shape representation, scale ambiguity, LVS6D dataset, robotic applications

185. ❌ OrbitNVS: Harnessing Video Diffusion Priors for Novel View Synthesis

作者: Jinglin Liang, Zijian Zhou, Rui Huang, Shuangping Huang, Yichen Gong 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19613v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	5.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文OrbitNVS专注于计算机视觉中的新视角合成任务，利用预训练视频扩散模型作为先验知识，通过模型适配和训练策略改进几何和外观一致性。论文与大多数关键词无关，因为这些关键词主要针对大语言模型、对齐、推理、代理等技术。仅与’Pre-training’和’Post-training’有一定关联（5分），因为论文使用了预训练视频模型并进行了任务特定的训练（类似微调），但并非核心大模型技术。其他关键词评分为0，因为论文未涉及语言模型、MoE、量化、推理加速、科学AI应用等主题。

!!! tip deepseek-chat TL;DR

论文提出OrbitNVS方法，通过将新视角合成重新定义为轨道视频生成任务，并利用预训练视频扩散模型的视觉先验，结合相机适配器、法线图分支和像素空间监督，显著提升了单视图输入下的合成质量和几何一致性，在基准测试中优于先前方法。

摘要翻译

新视角合成（Novel View Synthesis, NVS）旨在通过有限数量的已知视角生成三维物体的未知视角。现有方法通常难以对未观测区域合成合理的视角，尤其在单视图输入条件下，并且在保持几何与外观一致性方面仍面临挑战。为解决这些问题，我们提出OrbitNVS，将NVS重新定义为轨道视频生成任务。通过定制化的模型设计与训练策略，我们将预训练的视频生成模型适配于NVS任务，利用其丰富的视觉先验实现高质量的视角合成。具体而言，我们在视频模型中引入相机适配器以实现精确的相机控制。为增强三维物体的两个关键属性——几何与外观，我们设计了法线图生成分支，并利用法线图特征通过注意力机制引导目标视角的合成，从而提升几何一致性。此外，我们采用像素空间监督来缓解潜在空间中因空间压缩导致的模糊外观问题。大量实验表明，OrbitNVS在GSO和OmniObject3D基准测试中显著优于先前方法，尤其在具有挑战性的单视图设置下（例如，PSNR分别提升2.9 dB和2.4 dB）。

摘要 (Abstract)

Novel View Synthesis (NVS) aims to generate unseen views of a 3D object given a limited number of known views. Existing methods often struggle to synthesize plausible views for unobserved regions, particularly under single-view input, and still face challenges in maintaining geometry- and appearance-consistency. To address these issues, we propose OrbitNVS, which reformulates NVS as an orbit video generation task. Through tailored model design and training strategies, we adapt a pre-trained video generation model to the NVS task, leveraging its rich visual priors to achieve high-quality view synthesis. Specifically, we incorporate camera adapters into the video model to enable accurate camera control. To enhance two key properties of 3D objects, geometry and appearance, we design a normal map generation branch and use normal map features to guide the synthesis of the target views via attention mechanism, thereby improving geometric consistency. Moreover, we apply a pixel-space supervision to alleviate blurry appearance caused by spatial compression in the latent space. Extensive experiments show that OrbitNVS significantly outperforms previous methods on the GSO and OmniObject3D benchmarks, especially in the challenging single-view setting (\eg, +2.9 dB and +2.4 dB PSNR).

关键词: Novel View Synthesis, Video Diffusion Models, Orbit Video Generation, Camera Control, Normal Map Guidance, Geometry Consistency, Appearance Consistency, Single-view Synthesis

186. ❌ ParallelVLM: Lossless Video-LLM Acceleration with Visual Alignment Aware Parallel Speculative Decoding

作者: Quan Kong, Yuhao Shen, Yicheng Ji, Huan Li, Cong Wang 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19610v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	10.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文ParallelVLM专注于Video-LLMs的推理加速，核心贡献是提出了一种无训练的并行推测解码框架，以解决长视频场景中解码效率低下的问题。因此，它与’Large Language Models OR LLMs OR Foundation Models’高度相关（10分），因为Video-LLMs是LLMs在视频领域的应用；与’Speculative Decoding OR Inference Acceleration’高度相关（10分），因为推测解码是论文的核心技术。其他关键词如MoE、SLMs、Scaling Laws、训练方法（预训练、微调、对齐）、推理技术（CoT、MCTS）、代理、量化、幻觉缓解、可解释性、科学AI等，论文均未涉及或提及，故评分为0分。

!!! tip deepseek-chat TL;DR

该论文针对Video-LLMs在长视频理解中自回归解码效率低下的问题，提出了一种名为ParallelVLM的无训练并行推测解码框架，通过最大化硬件利用率和无偏验证器引导的剪枝策略，在LLaVA-Onevision-72B和Qwen2.5-VL-32B模型上分别实现了3.36倍和2.42倍的解码加速。

摘要翻译

尽管当前视频大语言模型在视频理解任务中取得了令人瞩目的性能，但其自回归解码效率仍受海量视频令牌数量的制约。视觉令牌剪枝可在一定程度上缓解这一瓶颈，然而现有方法仍存在信息损失问题，且仅能实现有限的解码加速。本文提出ParallelVLM，一种免训练的“草稿-验证”推测解码框架，该框架克服了长视频场景中草稿模型与目标模型之间相互等待及加速比受限的双重问题。ParallelVLM采用两个并行化阶段以最大化硬件利用率，并引入无偏验证器引导剪枝策略，通过消除注意力引导剪枝中的位置偏差，使草稿模型与目标模型更好对齐。大量实验表明，ParallelVLM在保持高接受长度的同时，将草稿窗口有效扩展了$1.6\sim1.8$倍；相比原始自回归解码，在LLaVA-Onevision-72B和Qwen2.5-VL-32B模型上分别实现了3.36倍和2.42倍的多项视频理解基准任务加速。

摘要 (Abstract)

Although current Video-LLMs achieve impressive performance in video understanding tasks, their autoregressive decoding efficiency remains constrained by the massive number of video tokens. Visual token pruning can partially ease this bottleneck, yet existing approaches still suffer from information loss and yield only modest acceleration in decoding. In this paper, we propose ParallelVLM, a training-free draft-then-verify speculative decoding framework that overcomes both mutual waiting and limited speedup-ratio problems between draft and target models in long-video settings. ParallelVLM features two parallelized stages that maximize hardware utilization and incorporate an Unbiased Verifier-Guided Pruning strategy to better align the draft and target models by eliminating the positional bias in attention-guided pruning. Extensive experiments demonstrate that ParallelVLM effectively expands the draft window by $1.6\sim1.8\times$ with high accepted lengths, and accelerates various video understanding benchmarks by 3.36$\times$ on LLaVA-Onevision-72B and 2.42$\times$ on Qwen2.5-VL-32B compared with vanilla autoregressive decoding.

关键词: Video-LLM, Speculative Decoding, Inference Acceleration, Autoregressive Decoding, Parallel Decoding, Visual Token Pruning, Long-video Understanding, Hardware Utilization

187. ❌ Physion-Eval: Evaluating Physical Realism in Generated Video via Human Reasoning

作者: Qin Zhang, Peiyu Jing, Hong-Xing Yu, Fangqiang Ding, Fan Nie, Weimin Wang, Yilun Du, James Zou, Jiajun Wu, Bing Shuai 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19607v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	10.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于视频生成模型的物理真实性评估，与绝大多数关键词（涉及大模型技术原理、训练方法、推理优化、对齐、应用等）完全无关。唯一相关的关键词是’World Models AND General World Models’，因为论文明确将视频生成模型作为世界模拟器（world simulators）进行研究，并评估其物理规律一致性，这与世界模型的概念高度相关，因此给予10分。其他关键词均未在论文标题或摘要中提及，且研究内容不涉及大模型或深度学习技术原理的创新，也不属于生物医药等特定领域的AI应用，因此评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了Physion-Eval基准，用于评估视频生成模型在物理真实性方面的表现，发现当前模型在物理关键场景中普遍存在可识别的人为物理故障（外中心视角83.3%，自我中心视角93.5%）。

摘要翻译

视频生成模型正日益被用作叙事、仿真与具身人工智能的世界模拟器。随着这些模型的进步，一个关键问题随之产生：生成的视频是否遵循真实世界的物理定律？现有评估方法主要依赖自动化指标或粗略的人工判断，如偏好评估或基于量规的检查。这些方法虽对感知质量评估有一定作用，却难以深入揭示生成动态在何时、为何违反真实世界的物理约束。我们推出Physion-Eval，这是一个基于专家推理的大规模基准测试，旨在诊断五种前沿模型在自我中心视角与外部视角下生成视频的物理真实感缺陷。该数据集包含10,990条专家推理轨迹，涵盖22个细粒度物理类别。每个生成视频均源自一段描绘清晰物理过程的真实世界参考视频，并标注了时间定位的异常点、结构化的故障类别以及对所违反物理行为的自然语言解释。利用该数据集，我们揭示了当前视频生成模型的一个显著局限：在物理关键场景中，83.3%的外部视角生成视频和93.5%的自我中心视角生成视频至少存在一处人类可识别的物理异常。我们希望Physion-Eval能为物理真实感评估设立新标准，并推动基于物理原理的视频生成技术的发展。该基准测试已公开于https://huggingface.co/datasets/PhysionLabs/Physion-Eval。

摘要 (Abstract)

Video generation models are increasingly used as world simulators for storytelling, simulation, and embodied AI. As these models advance, a key question arises: do generated videos obey the physical laws of the real world? Existing evaluations largely rely on automated metrics or coarse human judgments such as preferences or rubric-based checks. While useful for assessing perceptual quality, these methods provide limited insight into when and why generated dynamics violate real-world physical constraints. We introduce Physion-Eval, a large-scale benchmark of expert human reasoning for diagnosing physical realism failures in videos generated by five state-of-the-art models across egocentric and exocentric views, containing 10,990 expert reasoning traces spanning 22 fine-grained physical categories. Each generated video is derived from a corresponding real-world reference video depicting a clear physical process, and annotated with temporally localized glitches, structured failure categories, and natural-language explanations of the violated physical behavior. Using this dataset, we reveal a striking limitation of current video generation models: in physics-critical scenarios, 83.3% of exocentric and 93.5% of egocentric generated videos exhibit at least one human-identifiable physical glitch. We hope Physion-Eval will set a new standard for physical realism evaluation and guide the development of physics-grounded video generation. The benchmark is publicly available at https://huggingface.co/datasets/PhysionLabs/Physion-Eval.

关键词: video generation, physical realism, world simulators, evaluation benchmark, human reasoning, physical glitches, physics-grounded, Physion-Eval

188. ❌ Beyond Quadratic: Linear-Time Change Detection with RWKV

作者: Zhenyu Yang, Gensheng Pei, Tao Chen, Xia Yuan, Haofeng Zhang, Xiangbo Shu, Yazhou Yao 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19606v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于遥感变化检测的计算机视觉任务，提出了一种基于RWKV架构的新模型ChangeRWKV。论文的核心贡献是计算机视觉架构创新（结合Transformer和RNN优势），而非大语言模型或深度学习技术原理的创新。所有关键词均与语言模型、对齐、推理、代理等大模型技术相关，与本文的计算机视觉应用无直接关联。仅’AI for Science’关键词因遥感属于科学应用领域而获得5分（有一定关联），但论文并非大模型在科学领域的应用研究。

!!! tip deepseek-chat TL;DR

该论文解决了遥感变化检测中CNN缺乏全局上下文与Transformer计算成本高的矛盾，提出了基于RWKV架构的ChangeRWKV模型，在LEVIR-CD基准上实现了最先进的性能（85.46% IoU，92.16% F1），同时大幅减少了参数和计算量。

摘要翻译

现有遥感变化检测范式陷入两难困境：卷积神经网络（CNN）虽擅长高效处理，却缺乏全局上下文感知能力；而Transformer虽能捕获长距离依赖，却需付出难以承受的计算代价。本文提出ChangeRWKV这一新型架构以化解此矛盾。基于Receptance Weighted Key Value（RWKV）框架构建的ChangeRWKV，独特地融合了Transformer的可并行训练优势与循环神经网络（RNN）的线性时间推理特性。本方法的核心包含两大创新：一是构建多分辨率特征表征的层次化RWKV编码器；二是全新设计的时空融合模块（Spatial-Temporal Fusion Module, STFM），该模块专为消除跨尺度空间错位并提炼细粒度时序差异而构建。ChangeRWKV不仅在LEVIR-CD基准测试中取得85.46%交并比（IoU）和92.16% F1分数的先进性能，相较先前主流方法更大幅削减了参数量与浮点运算量（FLOPs）。本研究为业务级规模的变化检测展示了一种全新、高效且强大的技术范式。相关代码与模型已公开。

摘要 (Abstract)

Existing paradigms for remote sensing change detection are caught in a trade-off: CNNs excel at efficiency but lack global context, while Transformers capture long-range dependencies at a prohibitive computational cost. This paper introduces ChangeRWKV, a new architecture that reconciles this conflict. By building upon the Receptance Weighted Key Value (RWKV) framework, our ChangeRWKV uniquely combines the parallelizable training of Transformers with the linear-time inference of RNNs. Our approach core features two key innovations: a hierarchical RWKV encoder that builds multi-resolution feature representation, and a novel Spatial-Temporal Fusion Module (STFM) engineered to resolve spatial misalignments across scales while distilling fine-grained temporal discrepancies. ChangeRWKV not only achieves state-of-the-art performance on the LEVIR-CD benchmark, with an 85.46% IoU and 92.16% F1 score, but does so while drastically reducing parameters and FLOPs compared to previous leading methods. This work demonstrates a new, efficient, and powerful paradigm for operational-scale change detection. Our code and model are publicly available.

关键词: Change Detection, RWKV, Remote Sensing, Linear-Time Inference, Spatial-Temporal Fusion, LEVIR-CD, Computer Vision, Efficient Architecture

189. ❌ K-GMRF: Kinetic Gauss-Markov Random Field for First-Principles Covariance Tracking on Lie Groups

作者: ZhiMing Li 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19601v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文《K-GMRF: Kinetic Gauss-Markov Random Field for First-Principles Covariance Tracking on Lie Groups》专注于计算机视觉中的协方差矩阵跟踪问题，提出了一种基于李群和刚体动力学的在线、免训练框架。其核心内容涉及几何方法、运动学建模、微分几何和视觉跟踪，与所有评分关键词（均围绕大模型、深度学习技术原理及其应用）完全无关。论文未提及任何大模型、语言模型、训练技术、对齐方法、推理加速、AI代理或科学AI应用，因此所有关键词得分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于李群和刚体动力学的在线协方差跟踪框架K-GMRF，通过二阶动力学实现零稳态误差，在合成数据、SO(3)稳定性和运动模糊序列上显著提升了跟踪精度和鲁棒性。

摘要翻译

追踪非平稳协方差矩阵是视觉领域的基础任务，但现有估计器往往忽略流形约束或依赖一阶更新，在快速演变过程中不可避免地产生相位滞后。我们提出K-GMRF——一种用于协方差矩阵追踪的在线免训练框架，该框架将问题重新表述为李群上的受迫刚体运动。基于欧拉-庞加莱方程推导，我们的方法将观测值解释为驱动潜在角速度的扭矩，并通过结构保持的辛积分器进行传播。我们在理论上证明这种二阶动力学在恒定旋转条件下可实现零稳态误差，严格优于一阶基线的比例滞后特性。在三个领域的验证证明了其鲁棒的追踪保真度：（i）在合成椭圆数据集上，K-GMRF将角度误差较黎曼指数移动平均（Riemannian EMA）降低30倍，同时保持高速稳定性；（ii）在存在20%数据丢失的SO(3)稳定任务中，它将测地误差从29.4°降至9.9°；（iii）在OTB运动模糊序列中，BlurCar2数据集的交并比（IoU）从0.55提升至0.74，成功率高达96%。作为完全可微的辛模块，K-GMRF为数据受限场景提供了即插即用的几何先验，并可作为现代深度架构中可解释的层次模块。

摘要 (Abstract)

Tracking non-stationary covariance matrices is fundamental to vision yet hindered by existing estimators that either neglect manifold constraints or rely on first-order updates, incurring inevitable phase lag during rapid evolution. We propose K-GMRF, an online, training-free framework for covariance tracking that reformulates the problem as forced rigid-body motion on Lie groups. Derived from the Euler-Poincaré equations, our method interprets observations as torques driving a latent angular velocity, propagated via a structure-preserving symplectic integrator. We theoretically prove that this second-order dynamics achieves zero steady-state error under constant rotation, strictly superior to the proportional lag of first-order baselines. Validation across three domains demonstrates robust tracking fidelity: (i) on synthetic ellipses, K-GMRF reduces angular error by 30x compared to Riemannian EMA while maintaining stability at high speeds; (ii) on SO(3) stabilization with 20% dropout, it decreases geodesic error from 29.4° to 9.9°; and (iii) on OTB motion-blur sequences, it improves loU from 0.55 to 0.74 on BlurCar2 with a 96% success rate. As a fully differentiable symplectic module, K-GMRF provides a plug-and-play geometric prior for data-constrained scenarios and an interpretable layer within modern deep architectures.

关键词: covariance tracking, Lie groups, rigid-body motion, symplectic integrator, online estimation, geometric prior, kinetic Gauss-Markov random field, Euler-Poincaré equations

190. ❌ FlowScene: Style-Consistent Indoor Scene Generation with Multimodal Graph Rectified Flow

作者: Zhifei Yang, Guangyao Zhai, Keyang Lu, YuYang Yin, Chao Zhang, Zhen Xiao, Jieyi Long, Nassir Navab, Yikai Wang 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19598v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是室内场景生成，使用多模态图条件化的整流流模型来生成场景布局、物体形状和纹理，属于计算机视觉和图形学领域。所有关键词都聚焦于大语言模型（LLM）及其相关技术（如训练方法、推理优化、对齐、应用等），而本文完全不涉及语言模型或深度学习在科学领域的应用，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

本文提出FlowScene，一种基于多模态图条件化的三分支场景生成模型，通过紧密耦合的整流流模型实现物体间的协作推理，在生成真实感、风格一致性和人类偏好对齐方面优于现有方法。

摘要翻译

场景生成具有广泛的工业应用需求，既要求高真实感，又需要对几何结构与外观进行精确控制。基于语言的检索方法能够从大型物体数据库中组合出合理的场景，但忽略了物体层级的控制，且往往难以保证场景层级的风格一致性。基于图结构的建模方法通过对关系进行显式建模，提供了更高的物体可控性并保障了整体一致性，然而现有方法难以生成高保真度的纹理化结果，从而限制了其实用价值。本文提出FlowScene——一种基于多模态图条件的三分支场景生成模型，能够协同生成场景布局、物体形状与物体纹理。其核心是一个紧密耦合的修正流模型，在生成过程中交换物体信息，实现跨图结构的协同推理。这使得模型能够精细控制物体的形状、纹理及相互关系，同时在结构与外观层面保持场景层级的风格一致性。大量实验表明，FlowScene在生成真实感、风格一致性以及与人类偏好对齐度方面均优于基于语言条件和基于图条件的基线方法。

摘要 (Abstract)

Scene generation has extensive industrial applications, demanding both high realism and precise control over geometry and appearance. Language-driven retrieval methods compose plausible scenes from a large object database, but overlook object-level control and often fail to enforce scene-level style coherence. Graph-based formulations offer higher controllability over objects and inform holistic consistency by explicitly modeling relations, yet existing methods struggle to produce high-fidelity textured results, thereby limiting their practical utility. We present FlowScene, a tri-branch scene generative model conditioned on multimodal graphs that collaboratively generates scene layouts, object shapes, and object textures. At its core lies a tight-coupled rectified flow model that exchanges object information during generation, enabling collaborative reasoning across the graph. This enables fine-grained control of objects’ shapes, textures, and relations while enforcing scene-level style coherence across structure and appearance. Extensive experiments show that FlowScene outperforms both language-conditioned and graph-conditioned baselines in terms of generation realism, style consistency, and alignment with human preferences.

关键词: scene generation, multimodal graph, rectified flow, style coherence, object-level control, collaborative reasoning, indoor scenes, generative model

191. ❌ HiFiGaze: Improving Eye Tracking Accuracy Using Screen Content Knowledge

作者: Taejun Kim, Vimal Mollyn, Riku Arakawa, Chris Harrison 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19588v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是利用屏幕内容知识提高眼动追踪准确性的计算机视觉技术，属于人机交互领域。论文内容完全不涉及大模型、深度学习技术原理创新或AI for Science等关键词，所有关键词均与论文主题无关。

!!! tip deepseek-chat TL;DR

该论文提出了一种利用设备屏幕内容知识来分割用户眼睛中屏幕反射区域的新方法，以提高消费设备上眼动追踪的准确性，实验表明该方法比基于外观的基线模型减少了约8%的平均跟踪误差。

摘要翻译

我们提出了一种在消费级计算设备上进行视线估计的新颖且精确的方法。得益于智能手机、笔记本电脑和台式机等设备上用户面朝摄像头质量的持续提升——高端设备已配备4K或更高分辨率——如今已能够捕捉到设备屏幕在用户眼睛中的二维反射。然而，由于屏幕内容的近乎无限多样性，仅凭此信息不足以实现精确的视线追踪。但关键在于，设备自身知晓其屏幕上显示的内容——本研究中，我们证明这一信息可用于对反射区域进行鲁棒分割，而反射区域的位置和大小编码了用户相对于屏幕的视线目标。我们探索了多种策略以利用这一有效信号，并通过用户研究量化了其性能。我们表现最佳的模型相比基于外观的基线模型，平均追踪误差降低了约8%。一项补充研究进一步揭示，若视线追踪摄像头位于设备底部，性能还可额外提升10-20%。

摘要 (Abstract)

We present a new and accurate approach for gaze estimation on consumer computing devices. We take advantage of continued strides in the quality of user-facing cameras found in e.g., smartphones, laptops, and desktops - 4K or greater in high-end devices - such that it is now possible to capture the 2D reflection of a device’s screen in the user’s eyes. This alone is insufficient for accurate gaze tracking due to the near-infinite variety of screen content. Crucially, however, the device knows what is being displayed on its own screen - in this work, we show this information allows for robust segmentation of the reflection, the location and size of which encodes the user’s screen-relative gaze target. We explore several strategies to leverage this useful signal, quantifying performance in a user study. Our best performing model reduces mean tracking error by ~8% compared to a baseline appearance-based model. A supplemental study reveals an additional 10-20% improvement if the gaze-tracking camera is located at the bottom of the device.

关键词: gaze estimation, eye tracking, screen reflection, computer vision, human-computer interaction, user study, tracking error, consumer devices

192. ❌ MagicSeg: Open-World Segmentation Pretraining via Counterfactural Diffusion-Based Auto-Generation

作者: Kaixin Cai, Pengzhen Ren, Jianhua Han, Yi Zhu, Hang Xu, Jianzhuang Liu, Xiaodan Liang 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19575v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于计算机视觉领域的开放世界语义分割，使用扩散模型自动生成数据集，并通过对比学习进行预训练。所有评分关键词均与大语言模型（LLM）技术、训练方法、推理优化、代理系统等直接相关，而本文研究的是视觉任务（语义分割）和视觉模型（扩散模型、检测模型、分割模型），未涉及任何大语言模型技术或相关概念，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为MagicSeg的扩散模型驱动管道，用于自动生成开放世界语义分割的训练数据集，通过生成正负样本对进行对比训练，并结合开放词汇检测和交互式分割模型提取精确掩码，最终在多个基准测试中实现了最先进的性能。

摘要翻译

开放世界语义分割目前高度依赖大规模的图像-文本配对数据集，这些数据集通常缺乏足够类别上的细粒度像素标注。由于需要投入大量人力与时间成本，获取此类数据在经济上往往难以承受。鉴于扩散模型强大的图像生成能力，我们提出了一种新颖的扩散模型驱动流程，用于自动生成满足开放世界语义分割需求的数据集，命名为“MagicSeg”。我们的MagicSeg从类别标签出发，首先生成高保真的文本描述，随后以此引导扩散模型生成图像。我们的流程不仅为每个标签生成正样本，还同时生成对应的负样本图像，旨在作为配对的反事实样本用于对比训练。接着，为了给开放世界分割预训练提供自监督信号，MagicSeg集成了一个开放词汇检测模型和一个交互式分割模型，以基于提供的类别标签从图像中提取物体掩码作为精确的分割标签。通过将我们的数据集应用于结合伪掩码监督与辅助反事实对比训练的语言-图像对比预训练模型，下游模型在开放世界语义分割任务上获得了强大的性能。我们在PASCAL VOC、PASCAL Context和COCO数据集上评估了模型，分别取得了62.9%、26.7%和40.2%的性能，达到了当前最优水平（SOTA），证明了我们的数据集在提升开放世界语义分割能力方面的有效性。项目网站：https://github.com/ckxhp/magicseg。

摘要 (Abstract)

Open-world semantic segmentation presently relies significantly on extensive image-text pair datasets, which often suffer from a lack of fine-grained pixel annotations on sufficient categories. The acquisition of such data is rendered economically prohibitive due to the substantial investments of both human labor and time. In light of the formidable image generation capabilities of diffusion models, we introduce a novel diffusion model-driven pipeline for automatically generating datasets tailored to the needs of open-world semantic segmentation, named “MagicSeg”. Our MagicSeg initiates from class labels and proceeds to generate high-fidelity textual descriptions, which in turn serve as guidance for the diffusion model to generate images. Rather than only generating positive samples for each label, our process encompasses the simultaneous generation of corresponding negative images, designed to serve as paired counterfactual samples for contrastive training. Then, to provide a self-supervised signal for open-world segmentation pretraining, our MagicSeg integrates an open-vocabulary detection model and an interactive segmentation model to extract object masks as precise segmentation labels from images based on the provided category labels. By applying our dataset to the contrastive language-image pretraining model with the pseudo mask supervision and the auxiliary counterfactual contrastive training, the downstream model obtains strong performance on open-world semantic segmentation. We evaluate our model on PASCAL VOC, PASCAL Context, and COCO, achieving SOTA with performance of 62.9%, 26.7%, and 40.2%, respectively, demonstrating our dataset’s effectiveness in enhancing open-world semantic segmentation capabilities. Project website: https://github.com/ckxhp/magicseg.

关键词: Open-world semantic segmentation, Diffusion models, Dataset generation, Counterfactual samples, Contrastive training, Open-vocabulary detection, Interactive segmentation, Pseudo mask supervision

193. ❌ CurveStream: Boosting Streaming Video Understanding in MLLMs via Curvature-Aware Hierarchical Visual Memory Management

作者: Chao Wang, Xudong Tan, Jianjian Cao, Kangcong Li, Tao Chen 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19571v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	8.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于多模态大语言模型（MLLMs）在流媒体视频理解中的内存管理问题，与’Large Language Models’高度相关（10分），因为MLLMs是LLMs的扩展。与’Context Window Extension’有一定关联（8分），因为论文通过视觉记忆管理处理长序列输入，虽然不直接扩展上下文窗口，但解决了类似的长序列处理挑战。其他关键词如MoE、SLMs、训练方法、推理加速、AI for Science等与论文的视觉内存管理框架无直接关系，均得0分。

!!! tip deepseek-chat TL;DR

论文提出CurveStream框架，通过曲率感知的层次化视觉内存管理解决多模态大语言模型在流媒体视频理解中的内存爆炸和灾难性遗忘问题，在多个基准测试上实现了超过10%的性能提升。

摘要翻译

多模态大语言模型在离线视频理解领域已取得显著成功，但其在流式视频处理中的应用却因视觉令牌数量的线性激增而受到严重限制，常导致内存溢出错误或灾难性遗忘问题。现有的视觉信息保留与记忆管理方法通常依赖于均匀采样、低层物理度量或被动的缓存淘汰策略。然而，这些策略往往缺乏内在的语义感知能力，可能破坏上下文连贯性并模糊短暂却关键的语义转换。为应对这些局限，我们提出CurveStream——一种无需训练、基于曲率感知的分层视觉记忆管理框架。该方法的提出基于关键观察：连续特征轨迹上的高曲率区域与全局关键语义转换紧密对应。基于这一几何洞察，CurveStream通过曲率分数评估实时语义强度，并集成在线K-Sigma动态阈值机制，在严格令牌预算下自适应地将视频帧路由至清晰与模糊两种记忆状态。跨多时间尺度的评估证实，这一轻量级框架CurveStream相较于各基线模型均能实现超过10%的绝对性能提升（如在StreamingBench上提升10.69%，在OVOBench上提升13.58%），为流式视频感知任务确立了新的性能标杆。代码将在https://github.com/streamingvideos/CurveStream发布。

摘要 (Abstract)

Multimodal Large Language Models have achieved significant success in offline video understanding, yet their application to streaming videos is severely limited by the linear explosion of visual tokens, which often leads to Out-of-Memory (OOM) errors or catastrophic forgetting. Existing visual retention and memory management methods typically rely on uniform sampling, low-level physical metrics, or passive cache eviction. However, these strategies often lack intrinsic semantic awareness, potentially disrupting contextual coherence and blurring transient yet critical semantic transitions. To address these limitations, we propose CurveStream, a training-free, curvature-aware hierarchical visual memory management framework. Our approach is motivated by the key observation that high-curvature regions along continuous feature trajectories closely align with critical global semantic transitions. Based on this geometric insight, CurveStream evaluates real-time semantic intensity via a Curvature Score and integrates an online K-Sigma dynamic threshold to adaptively route frames into clear and fuzzy memory states under a strict token budget. Evaluations across diverse temporal scales confirm that this lightweight framework, CurveStream, consistently yields absolute performance gains of over 10% (e.g., 10.69% on StreamingBench and 13.58% on OVOBench) over respective baselines, establishing new state-of-the-art results for streaming video perception.The code will be released at https://github.com/streamingvideos/CurveStream.

关键词: Multimodal Large Language Models, Streaming Video Understanding, Visual Memory Management, Curvature-Aware, Hierarchical Memory, Semantic Transitions, Token Budget, State-of-the-Art

194. ❌ Accelerating Diffusion Decoders via Multi-Scale Sampling and One-Step Distillation

作者: Chuhan Wang, Hao Chen 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19570v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	5.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于扩散模型在图像标记化中的解码加速，提出多尺度采样和单步蒸馏方法。所有关键词均与大语言模型（LLM）相关，而本文研究的是计算机视觉中的扩散模型，属于完全不同的技术领域。唯一有微弱关联的是’Speculative Decoding OR Inference Acceleration’，因为论文涉及推理加速，但具体技术（扩散模型采样加速）与LLM推理加速技术不同，故给5分。其他关键词均与LLM技术、对齐、代理、科学AI应用等无关，全部0分。

!!! tip deepseek-chat TL;DR

该论文针对扩散解码器在图像标记化中采样速度慢的问题，提出了多尺度采样和单步蒸馏的两阶段加速框架，实现了解码时间数量级减少且质量损失小。

摘要翻译

图像标记化在现代生成建模中扮演着核心角色，其通过将视觉输入映射为紧凑的表示，作为像素与生成模型之间的中间信号。基于扩散的解码器近期被应用于图像标记化中，以从潜在表示重建具有高感知保真度的图像。与用于下游生成的扩散模型不同，这些解码器专注于忠实重建而非内容生成。然而，其迭代采样过程引入了显著的延迟，使其难以适用于实时或大规模应用。在本工作中，我们引入了一个两阶段加速框架以解决此效率问题。首先，我们提出了一种多尺度采样策略，其中解码从粗分辨率开始，并通过在每个阶段将分辨率加倍来逐步细化输出，与标准的全分辨率采样相比，实现了理论上的 $\mathcal{O}(\log n)$ 加速。其次，我们将每个尺度上的扩散解码器蒸馏为一个单步去噪模型，从而在每个尺度上仅需一次前向传播即可实现快速且高质量的重建。这些技术共同将解码时间减少了一个数量级，而输出质量几乎没有下降。我们的方法为构建高效且富有表现力的图像标记器提供了一条实用路径。我们希望它能作为未来高效视觉标记化及下游生成研究的基础。

摘要 (Abstract)

Image tokenization plays a central role in modern generative modeling by mapping visual inputs into compact representations that serve as an intermediate signal between pixels and generative models. Diffusion-based decoders have recently been adopted in image tokenization to reconstruct images from latent representations with high perceptual fidelity. In contrast to diffusion models used for downstream generation, these decoders are dedicated to faithful reconstruction rather than content generation. However, their iterative sampling process introduces significant latency, making them impractical for real-time or large-scale applications. In this work, we introduce a two-stage acceleration framework to address this inefficiency. First, we propose a multi-scale sampling strategy, where decoding begins at a coarse resolution and progressively refines the output by doubling the resolution at each stage, achieving a theoretical speedup of $\mathcal{O}(\log n)$ compared to standard full-resolution sampling. Second, we distill the diffusion decoder at each scale into a single-step denoising model, enabling fast and high-quality reconstructions in a single forward pass per scale. Together, these techniques yield an order-of-magnitude reduction in decoding time with little degradation in output quality. Our approach provides a practical pathway toward efficient yet expressive image tokenizers. We hope it serves as a foundation for future work in efficient visual tokenization and downstream generation.

关键词: diffusion decoders, image tokenization, multi-scale sampling, one-step distillation, inference acceleration, latent representations, generative modeling, perceptual fidelity

195. ❌ MeanFlow Meets Control: Scaling Sampled-Data Control for Swarms

作者: Anqi Dong, Yongxin Chen, Karl H. Johansson, Johan Karlsson 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20189v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是控制理论中的群体系统采样数据控制问题，采用MeanFlow方法和控制空间学习框架，专注于线性时不变动力学下的群体引导。所有评分关键词均涉及大模型、深度学习、AI技术原理或特定AI应用领域，而本文属于传统控制理论与优化领域，未涉及任何AI/ML模型、算法或应用，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文针对大规模群体系统在采样数据控制下的引导问题，提出了一个基于MeanFlow的控制空间学习框架，通过参数化有限时域最小能量控制系数来实现与真实控制系统采样数据结构一致的少步群体引导。

摘要翻译

在仅通过少量控制更新来引导大规模集群的挑战在于实际系统以采样数据形式运行：控制输入间歇性地更新并在有限区间内持续施加。在此模式下，关键对象并非瞬时速度场，而是捕捉每个采样区间内系统响应的有限窗口控制量。受MeanFlow启发，我们针对线性时不变动力学下的集群引导问题，提出了一种控制空间学习框架。该框架的学习对象是参数化每个区间内有限时域最小能量控制的系数。我们证明该系数既具有积分表达形式，又沿桥轨迹满足局部微分恒等式，从而导出了简洁的停梯度训练目标。在实施阶段，学习得到的系数直接用于采样数据更新，因此预设的动力学模型与驱动映射在结构层面得以严格遵循。该框架为符合实际控制系统采样数据结构的少步数集群引导提供了一种可扩展的解决方案。

摘要 (Abstract)

Steering large-scale swarms in only a few control updates is challenging because real systems operate in sampled-data form: control inputs are updated intermittently and applied over finite intervals. In this regime, the natural object is not an instantaneous velocity field, but a finite-window control quantity that captures the system response over each sampling interval. Inspired by MeanFlow, we introduce a control-space learning framework for swarm steering under linear time-invariant dynamics. The learned object is the coefficient that parameterizes the finite-horizon minimum-energy control over each interval. We show that this coefficient admits both an integral representation and a local differential identity along bridge trajectories, which leads to a simple stop-gradient training objective. At implementation time, the learned coefficient is used directly in sampled-data updates, so the prescribed dynamics and actuation map are respected by construction. The resulting framework provides a scalable approach to few-step swarm steering that is consistent with the sampled-data structure of real control systems.

关键词: swarm steering, sampled-data control, MeanFlow, control-space learning, finite-horizon control, linear time-invariant dynamics, minimum-energy control, bridge trajectories

196. ❌ Revisiting Gene Ontology Knowledge Discovery with Hierarchical Feature Selection and Virtual Study Group of AI Agents

作者: Cen Wan, Alex A. Freitas 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20132v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	10.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文明确提到使用大语言模型（LLMs）和代理AI技术（agentic AI）进行生物信息学知识发现，因此与’Large Language Models’、‘LLM Agents’、‘Multi-agent Systems’和’AI for Science’高度相关（10分）。其他关键词如MoE、量化、推理加速等未在摘要中提及，因此评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于代理AI的虚拟研究组框架，利用大语言模型从基因本体术语中提取衰老相关的生物学知识，并通过文献验证了AI生成的科学主张的可靠性。

摘要翻译

大语言模型已在多项挑战性任务中取得显著成功，其能力可通过新兴的智能体人工智能技术得到进一步增强。这一新型计算范式已开始革新传统的科学发现流程。本研究提出一种基于智能体人工智能、面向知识发现的新型虚拟研究小组，旨在通过采用经层次特征选择方法筛选出的高度衰老相关基因本体（Gene Ontology, GO）术语，挖掘有意义的衰老相关生物学知识。我们通过分析四种不同模式生物的衰老相关基因本体术语来评估所提智能体人工智能框架的性能，并借助现有研究文献对生物学发现进行验证。结果表明，智能体生成的大部分科学主张能够得到现有文献支持，且所提出的虚拟研究小组内部机制在该基于智能体人工智能的知识发现框架中也发挥着重要作用。

摘要 (Abstract)

Large language models have achieved great success in multiple challenging tasks, and their capacity can be further boosted by the emerging agentic AI techniques. This new computing paradigm has already started revolutionising the traditional scientific discovery pipelines. In this work, we propose a novel agentic AI-based knowledge discovery-oriented virtual study group that aims to extract meaningful ageing-related biological knowledge considering highly ageing-related Gene Ontology terms that are selected by hierarchical feature selection methods. We investigate the performance of the proposed agentic AI framework by considering four different model organisms’ ageing-related Gene Ontology terms and validate the biological findings by reviewing existing research articles. It is found that the majority of the AI agent-generated scientific claims can be supported by existing literatures and the proposed internal mechanisms of the virtual study group also play an important role in the designed agentic AI-based knowledge discovery framework.

关键词: Large Language Models, Agentic AI, Virtual Study Group, Gene Ontology, Bioinformatics, Knowledge Discovery, Ageing-related Biology, AI Agents

197. ❌ Kolmogorov-Arnold causal generative models

作者: Alejandro Almodóvar, Mar Elizo, Patricia A. Apellániz, Santiago Zazo, Juan Parras 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20184v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文提出KaCGM，一种基于Kolmogorov-Arnold Networks（KAN）的因果生成模型，专注于表格数据的可解释因果建模。与大多数关键词（涉及大模型技术、训练方法、推理优化、代理系统等）完全无关。仅与’Mechanistic Interpretability OR Explainable AI’有一定关联（5分），因为论文强调模型透明度和可解释性；与’AI for Science OR Bioinformatics OR Cheminformatics’有一定关联（5分），因为包含心血管案例研究，属于科学/生物医学应用，但非核心大模型技术。论文本质是因果推断与可解释AI的结合，未涉及大模型或深度学习技术原理创新。

!!! tip deepseek-chat TL;DR

该论文提出KaCGM，一种基于Kolmogorov-Arnold Networks的可解释因果生成模型，用于表格数据，实现了竞争性的性能并支持透明因果机制分析。

摘要翻译

因果生成模型为从观测数据中回答观测性、干预性和反事实性查询提供了原则性框架。然而，许多深度因果模型依赖高度表达性但机制不透明的架构，限制了在高风险领域的可审计性。我们提出KaCGM，一种针对混合类型表格数据的因果生成模型，其中每个结构方程均由柯尔莫哥洛夫-阿诺德网络（Kolmogorov–Arnold Network，KAN）参数化。这种分解方式使得能够直接检查学习到的因果机制，包括符号近似和父子关系的可视化，同时保持与查询无关的生成语义。我们引入了一个基于分布匹配和推断外生变量独立性诊断的验证流程，允许仅使用观测数据进行评估。在合成和半合成基准测试上的实验表明，其性能与最先进方法相比具有竞争力。一项真实世界心血管案例研究进一步展示了简化结构方程和可解释因果效应的提取能力。这些结果表明，表达性因果生成建模与功能透明度可以同时实现，支持在表格化决策场景中的可信部署。代码：https://github.com/aalmodovares/kacgm

摘要 (Abstract)

Causal generative models provide a principled framework for answering observational, interventional, and counterfactual queries from observational data. However, many deep causal models rely on highly expressive architectures with opaque mechanisms, limiting auditability in high-stakes domains. We propose KaCGM, a causal generative model for mixed-type tabular data where each structural equation is parameterized by a Kolmogorov–Arnold Network (KAN). This decomposition enables direct inspection of learned causal mechanisms, including symbolic approximations and visualization of parent–child relationships, while preserving query-agnostic generative semantics. We introduce a validation pipeline based on distributional matching and independence diagnostics of inferred exogenous variables, allowing assessment using observational data alone. Experiments on synthetic and semi-synthetic benchmarks show competitive performance against state-of-the-art methods. A real-world cardiovascular case study further demonstrates the extraction of simplified structural equations and interpretable causal effects. These results suggest that expressive causal generative modeling and functional transparency can be achieved jointly, supporting trustworthy deployment in tabular decision-making settings. Code: https://github.com/aalmodovares/kacgm

关键词: causal generative models, Kolmogorov-Arnold Networks, interpretability, tabular data, structural equations, trustworthy AI, cardiovascular case study, exogenous variables

198. ❌ Conditioning Protein Generation via Hopfield Pattern Multiplicity

作者: Jeffrey D. Varner 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20115v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于蛋白质序列生成，属于生物信息学领域，与关键词’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分），因为它应用AI技术（基于Hopfield模式的随机注意力机制）解决蛋白质设计问题。然而，论文未涉及大语言模型（LLMs）、深度学习技术原理创新或关键词列表中的其他技术主题，如MoE、SLMs、缩放定律、训练方法、推理优化、代理系统等，因此其他所有关键词均得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种通过添加偏置参数到采样器注意力对数的方法，无需重新训练即可将蛋白质序列生成从整个家族定向到用户指定的功能子集，并应用于钙通道靶向肽的候选库生成。

摘要翻译

通过随机注意力机制生成蛋白质序列的方法，无需训练即可从小规模比对中生成合理的家族成员，但其对所有存储序列一视同仁，无法引导生成过程朝向特定功能子集。本研究表明，在采样器的注意力对数中添加单个标量参数作为偏置，即可在不重新训练且不改变模型架构的情况下，将生成序列从完整家族连续转向用户指定的子集。使用者只需提供一小部分序列（例如结合筛选中的命中序列）及一个控制生成过程偏向该子集强度的倍数比参数。该方法不依赖于子集所代表的特定属性：无论是结合性、稳定性、特异性还是其他任何功能特性。我们发现，条件控制在采样器内部表征层面是精确的，但解码后的序列表型可能无法完全实现预期，因为用于编码序列的降维方法并不总能保留定义功能分化的残基水平变异。我们将这种差异称为校准间隙，并证明其可通过一个简单的几何度量来预测——该度量反映了编码过程对功能子集与家族其他部分的分离程度。在五个Pfam家族（Kunitz、SH3、WW、同源异型盒及Forkhead结构域）上的实验证实，在四倍几何变化范围内，分离度与校准间隙之间存在单调关系。将此方法应用于靶向疼痛信号相关钙通道的ω-芋螺毒素肽，从23个经表征的结合剂序列出发，生成了上千个保留主要药效团及所有实验验证结合决定簇的候选序列。这些结果表明，随机注意力机制使研究者能够将少量实验表征序列扩展为多样化的候选库，而无需重新训练生成模型。

摘要 (Abstract)

Protein sequence generation via stochastic attention produces plausible family members from small alignments without training, but treats all stored sequences equally and cannot direct generation toward a functional subset of interest. We show that a single scalar parameter, added as a bias to the sampler’s attention logits, continuously shifts generation from the full family toward a user-specified subset, with no retraining and no change to the model architecture. A practitioner supplies a small set of sequences (for example, hits from a binding screen) and a multiplicity ratio that controls how strongly generation favors them. The method is agnostic to what the subset represents: binding, stability, specificity, or any other property. We find that the conditioning is exact at the level of the sampler’s internal representation, but that the decoded sequence phenotype can fall short because the dimensionality reduction used to encode sequences does not always preserve the residue-level variation that defines the functional split. We term this discrepancy the calibration gap and show that it is predicted by a simple geometric measure of how well the encoding separates the functional subset from the rest of the family. Experiments on five Pfam families (Kunitz, SH3, WW, Homeobox, and Forkhead domains) confirm the monotonic relationship between separation and gap across a fourfold range of geometries. Applied to omega-conotoxin peptides targeting a calcium channel involved in pain signaling, curated seeding from 23 characterized binders produces over a thousand candidates that preserve the primary pharmacophore and all experimentally identified binding determinants. These results show that stochastic attention enables practitioners to expand a handful of experimentally characterized sequences into diverse candidate libraries without retraining a generative model.

关键词: Protein sequence generation, Stochastic attention, Hopfield pattern, Conditioning, Functional subset, Calibration gap, Bioinformatics, Generative model

199. ❌ GO-GenZip: Goal-Oriented Generative Sampling and Hybrid Compression

作者: Pietro Talli, Qi Liao, Alessandro Lieto, Parijat Bhattacharjee, Federico Chiariotti, Andrea Zanella 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20109v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究网络遥测数据的生成式AI采样与压缩框架，虽然涉及生成式AI（GenAI），但未具体讨论大语言模型（LLM）或深度学习技术原理，也未涉及生物医药等科学领域应用。所有关键词均聚焦于大语言模型技术、对齐、推理、压缩等具体方向，或AI在科学领域的应用，而本文的生成式AI应用仅限于网络数据压缩，与这些关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种面向目标的生成式AI采样与混合压缩框架，用于优化网络遥测数据采集与传输，实验表明能减少50%以上的采样与数据传输成本，同时保持下游任务的重建精度与分析保真度。

摘要翻译

当前网络数据遥测管道由海量细粒度关键性能指标流构成，这些数据从多个分布式源流向中央聚合器，导致数据存储、传输和实时分析日益难以持续。本研究提出一种生成式人工智能驱动的采样与混合压缩框架，从目标导向的视角重构网络遥测体系。与传统方法被动压缩全量观测数据不同，我们的方法基于信息与下游任务的相关性，联合优化观测内容与编码方式。该框架集成自适应采样策略（采用自适应掩码技术）与生成式建模，以识别时空维度的数据模式并保留关键特征。通过选择性采集的数据进一步经由混合压缩方案处理，该方案结合传统无损编码与生成式人工智能驱动的有损压缩。在真实网络数据集上的实验表明，该方法在保持可比拟的重构精度与目标导向分析保真度的同时，能降低超过50%的采样与数据传输成本。

摘要 (Abstract)

Current network data telemetry pipelines consist of massive streams of fine-grained Key Performance Indicators (KPIs) from multiple distributed sources towards central aggregators, making data storage, transmission, and real-time analysis increasingly unsustainable. This work presents a generative AI (GenAI)-driven sampling and hybrid compression framework that redesigns network telemetry from a goal-oriented perspective. Unlike conventional approaches that passively compress fully observed data, our approach jointly optimizes what to observe and how to encode it, guided by the relevance of information to downstream tasks. The framework integrates adaptive sampling policies, using adaptive masking techniques, with generative modeling to identify patterns and preserve critical features across temporal and spatial dimensions. The selectively acquired data are further processed through a hybrid compression scheme that combines traditional lossless coding with GenAI-driven, lossy compression. Experimental results on real network datasets demonstrate over 50$%$ reductions in sampling and data transfer costs, while maintaining comparable reconstruction accuracy and goal-oriented analytical fidelity in downstream tasks.

关键词: Generative AI, Network Telemetry, Goal-oriented Sampling, Hybrid Compression, Adaptive Masking, Data Transfer Reduction, KPIs, Reconstruction Accuracy

200. ❌ Trojan horse hunt in deep forecasting models: Insights from the European Space Agency competition

作者: Krzysztof Kotowski, Ramez Shendy, Jakub Nalepa, Agata Kaczmarek, Dawid Płudowski, Piotr Wilczyński, Artur Janicki, Przemysław Biecek, Ambros Marzetta, Atul Pande, Lalit Chandra Routhu, Swapnil Srivastava, Evridiki Ntagiou 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20108v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models	0.0	0.0/10	0.0
Mixture of Experts	0.0	0.0/10	0.0
Small Language Models	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是深度学习预测模型中的特洛伊木马攻击检测，属于深度学习安全领域。论文内容聚焦于时间序列预测模型的安全漏洞检测，具体涉及数据科学竞赛的组织、任务设计、评估方法和解决方案分析。所有评分关键词均与大语言模型（LLM）技术、训练方法、推理优化、对齐技术、代理系统等大模型相关主题相关，而本文完全不涉及这些内容。论文讨论的是传统深度学习预测模型的安全问题，而非大模型技术或其在科学领域的应用创新，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文通过组织欧洲航天局的数据科学竞赛，研究了深度学习时间序列预测模型中特洛伊木马攻击的检测方法，总结了任务设计、评估协议和最佳解决方案，并提出了有效识别触发器的关键见解和研究方向。

摘要翻译

预测在现代安全关键应用中扮演着至关重要的角色，例如航天任务。然而，深度学习预测模型的日益广泛应用也带来了新的安全风险——特洛伊木马攻击，这种攻击通过在训练数据中或直接在模型权重中隐藏后门来实现。一旦后门被植入，在测试阶段通过特定的触发模式即可将其激活，导致模型产生被操纵的预测结果。我们在“特洛伊木马猎手”（Trojan Horse Hunt）数据科学竞赛中聚焦此问题，超过200支参赛队伍面临的任务是从航天器遥测数据的深度预测模型中识别隐藏的触发模式。本文阐述了该竞赛新颖的任务设定、基准数据集、评估方案以及最优解决方案。我们进一步总结了在时间序列预测模型中有效识别触发模式的关键见解与研究方向。所有材料已在竞赛官方网站（https://www.kaggle.com/competitions/trojan-horse-hunt-in-space）上公开。

摘要 (Abstract)

Forecasting plays a crucial role in modern safety-critical applications, such as space operations. However, the increasing use of deep forecasting models introduces a new security risk of trojan horse attacks, carried out by hiding a backdoor in the training data or directly in the model weights. Once implanted, the backdoor is activated by a specific trigger pattern at test time, causing the model to produce manipulated predictions. We focus on this issue in our \textit{Trojan Horse Hunt} data science competition, where more than 200 teams faced the task of identifying triggers hidden in deep forecasting models for spacecraft telemetry. We describe the novel task formulation, benchmark set, evaluation protocol, and best solutions from the competition. We further summarize key insights and research directions for effective identification of triggers in time series forecasting models. All materials are publicly available on the official competition webpage https://www.kaggle.com/competitions/trojan-horse-hunt-in-space.

关键词: Trojan horse attacks, deep forecasting models, time series forecasting, backdoor detection, spacecraft telemetry, security risk, data science competition, trigger identification

201. ❌ Antenna Array Beamforming Based on a Hybrid Quantum Optimization Framework

作者: Shuai Zeng 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20072v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于天线阵列波束成形优化，采用量子启发式优化框架，属于无线通信工程领域。论文内容完全不涉及大语言模型、深度学习、AI for Science或任何评分关键词中的技术概念。所有关键词均与论文主题无关，因此全部评分为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种用于大规模天线阵列波束成形优化的混合量子优化框架，通过结合量子启发搜索和经典梯度细化，在32元天线阵列的仿真中实现了比基线分数提高近一倍的性能。

摘要翻译

本文提出一种用于大规模天线阵列波束赋形的混合量子优化框架，能够联合优化离散相位与连续幅度。该方法结合量子启发式搜索与经典梯度优化，以高效处理混合离散-连续变量。在相位优化方面，引入格雷码与奇组合编码方案，以提升鲁棒性并避免高阶伊辛模型复杂度爆炸。在幅度优化方面，提出几何自旋组合编码与两阶段策略，采用量子启发式优化进行粗搜索，并利用梯度优化进行精细调优。为增强解多样性与质量，彩虹量子启发算法集成多个优化器进行并行探索，随后采用基于层次聚类的候选解精炼。此外，提出双外积方法及其增强版本，以高效构建耦合矩阵与偏置向量，从而提升数值精度与实现效率。在第七届全国量子计算黑客松评分规则下，对32单元天线阵列的仿真表明，所提方法在近主瓣旁瓣、广角旁瓣、波束宽度及优化时间约束下获得461.58分，较基准分数提升近一倍。该框架为未来无线通信系统中的波束赋形优化提供了有效参考。

摘要 (Abstract)

This paper proposes a hybrid quantum optimization framework for large-scale antenna-array beamforming with jointly optimized discrete phases and continuous amplitudes. The method combines quantum-inspired search with classical gradient refinement to handle mixed discrete-continuous variables efficiently. For phase optimization, a Gray-code and odd-combination encoding scheme is introduced to improve robustness and avoid the complexity explosion of higher-order Ising models. For amplitude optimization, a geometric spin-combination encoding and a two-stage strategy are developed, using quantum-inspired optimization for coarse search and gradient optimization for fine refinement. To enhance solution diversity and quality, a rainbow quantum-inspired algorithm integrates multiple optimizers for parallel exploration, followed by hierarchical-clustering-based candidate refinement. In addition, a double outer-product method and an augmented version are proposed to construct the coupling matrix and bias vector efficiently, improving numerical precision and implementation efficiency. Under the scoring rules of the 7th National Quantum Computing Hackathon, simulations on a 32-element antenna array show that the proposed method achieves a score of 461.58 under constraints on near-main-lobe sidelobes, wide-angle sidelobes, beamwidth, and optimization time, nearly doubling the baseline score. The proposed framework provides an effective reference for beamforming optimization in future wireless communication systems.

关键词: antenna array beamforming, quantum optimization, hybrid framework, phase optimization, amplitude optimization, wireless communication, quantum-inspired algorithm, sidelobe suppression

202. ❌ How Out-of-Equilibrium Phase Transitions can Seed Pattern Formation in Trained Diffusion Models

作者: Luca Ambrogioni 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20092v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究扩散模型（Diffusion Models）的理论框架，将其生成过程解释为非平衡相变，并探讨了模式形成机制。所有评分关键词均与大语言模型（LLMs）相关，而本文专注于扩散模型，属于生成模型的不同分支。论文未涉及LLMs、MoE、SLMs、对齐、微调、推理加速、智能体等任何关键词相关技术。虽然属于AI领域，但主题与评分关键词集完全无关，因此所有关键词得分为0。

!!! tip deepseek-chat TL;DR

该论文提出一个理论框架，将训练好的扩散模型的生成过程解释为非平衡相变，揭示了其通过临界状态形成超越训练数据的连贯模式，并证明了该临界机制可用于改进生成控制。

摘要翻译

在本研究中，我们提出一个理论框架，将训练好的扩散模型中的生成过程解释为非平衡相变的一个实例。我们认为，反向扩散并非从噪声到数据平滑演化，而是会经过一个临界区域，其中微小的空间涨落被放大并孕育出大规模结构的雏形。我们的核心见解是：局部性、稀疏性和平移不变性等架构约束，将记忆驱动的不稳定性转化为集体空间模式，从而能够形成超出训练数据的连贯图案。通过使用可解析处理的补丁分数模型，我们展示了经典的对称破缺分岔如何推广为由软化傅里叶模式和增长关联长度所描述的空间扩展临界现象。我们进一步将这些动力学与金兹堡-朗道类型的有效场论以及非平衡物理中的图案形成机制联系起来。在训练好的卷积扩散模型上获得的实证结果验证了该理论，揭示了临界性的特征，包括模式软化和空间关联的快速增长。最后，我们证明这一临界区域具有实际意义：在估计的临界时间施加针对性扰动（例如无分类器引导脉冲）能显著提升生成控制能力。综上，这些发现将非平衡临界现象确立为一个统一原理，用于理解并可能改进现代扩散模型的行为。

摘要 (Abstract)

In this work, we propose a theoretical framework that interprets the generation process in trained diffusion models as an instance of out-of-equilibrium phase transitions. We argue that, rather than evolving smoothly from noise to data, reverse diffusion passes through a critical regime in which small spatial fluctuations are amplified and seed the emergence of large-scale structure. Our central insight is that architectural constraints, such as locality, sparsity, and translation equivariance, transform memorization-driven instabilities into collective spatial modes, enabling the formation of coherent patterns beyond the training data. Using analytically tractable patch score models, we show how classical symmetry-breaking bifurcations generalize into spatially extended critical phenomena described by softening Fourier modes and growing correlation lengths. We further connect these dynamics to effective field theories of the Ginzburg-Landau type and to mechanisms of pattern formation in non-equilibrium physics. Empirical results on trained convolutional diffusion models corroborate the theory, revealing signatures of criticality including mode softening and rapid growth of spatial correlations. Finally, we demonstrate that this critical regime has practical relevance: targeted perturbations, such as classifier-free guidance pulses applied at the estimated critical time, significantly improve generation control. Together, these findings position non-equilibrium critical phenomena as a unifying principle for understanding, and potentially improving, the behavior of modern diffusion models.

关键词: diffusion models, out-of-equilibrium phase transitions, pattern formation, critical phenomena, spatial correlations, generation control, Ginzburg-Landau theory

203. ❌ Federated Hyperdimensional Computing for Resource-Constrained Industrial IoT

作者: Nikita Zeulin, Olga Galinina, Nageen Himayat, Sergey Andreev 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20037v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是工业物联网（IIoT）中的联邦学习和超维计算（HDC），这是一种轻量级学习范式，用于资源受限的边缘设备。所有给定的关键词都专注于大语言模型（LLMs）、深度学习技术及其特定应用（如AI for Science）。本文的核心是HDC和联邦学习，与LLMs、深度学习模型训练、对齐、推理优化、代理系统或科学AI应用没有直接关系。因此，所有关键词的相关性评分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于联邦超维计算（Federated HDC）的轻量级框架，用于解决资源受限的工业物联网（IIoT）中分布式智能的挑战，实现了快速收敛和通信高效的协同学习。

摘要翻译

在工业物联网（IIoT）系统中，边缘设备通常在内存、计算能力和无线带宽方面面临严格限制。这些限制对部署高级数据分析任务（如预测性维护和规范性维护）构成了挑战。本研究探索了超维计算（HDC）作为一种适用于资源受限IIoT的轻量级学习范式。传统的集中式HDC利用高维向量空间的特性，实现了高能效的训练与推理。我们将此范式集成到联邦学习（FL）框架中，在该框架下设备仅交换原型表示，从而显著降低了通信开销。数值结果凸显了联邦HDC在IIoT中支持协作学习的潜力，其具备快速收敛速度和通信高效性。这些结果表明，HDC为大规模、资源受限的IIoT环境中的分布式智能提供了一个轻量且鲁棒的框架。

摘要 (Abstract)

In the Industrial Internet of Things (IIoT) systems, edge devices often operate under strict constraints in memory, compute capability, and wireless bandwidth. These limitations challenge the deployment of advanced data analytics tasks, such as predictive and prescriptive maintenance. In this work, we explore hyperdimensional computing (HDC) as a lightweight learning paradigm for resource-constrained IIoT. Conventional centralized HDC leverages the properties of high-dimensional vector spaces to enable energy-efficient training and inference. We integrate this paradigm into a federated learning (FL) framework where devices exchange only prototype representations, which significantly reduces communication overhead. Our numerical results highlight the potential of federated HDC to support collaborative learning in IIoT with fast convergence speed and communication efficiency. These results indicate that HDC represents a lightweight and resilient framework for distributed intelligence in large-scale and resource-constrained IIoT environments.

关键词: Federated Learning, Hyperdimensional Computing, Industrial IoT, Resource-Constrained, Edge Devices, Communication Efficiency, Lightweight Learning, Distributed Intelligence

204. ❌ Continual Learning as Shared-Manifold Continuation Under Compatible Shift

作者: Henry J. Kobs 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20036v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是持续学习（continual learning）中的几何方法，特别是通过保持共享流形结构来减少灾难性遗忘。虽然持续学习是大模型训练中的一个相关概念，但论文专注于计算机视觉任务（CIFAR10、Tiny-ImageNet）和合成基准测试，没有涉及大语言模型、深度学习技术原理创新或科学领域的应用。所有关键词都直接针对大语言模型技术、训练方法、推理优化、对齐、应用等具体方面，而本文的核心是持续学习的几何视角和表示保持，与这些关键词没有直接关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于共享流形延续的持续学习方法SPMA-OG，通过几何感知的锚点正则化在兼容数据偏移下保持旧任务的表示，在CIFAR10和Tiny-ImageNet上提高了旧任务保留率并保持了新任务准确性。

摘要翻译

持续学习方法通常通过正则化参数、匹配旧输出或回放先前样本来保持旧有行为。这些策略可以减少遗忘，但并未直接规定潜在表征应如何演化。我们研究了一种更精确的几何替代方案，适用于新旧数据应保持相同潜在支撑的情形：将持续学习视为共享流形的延续。我们在"支撑保持流形同化"框架中实例化了这一观点，并评估了其几何保持变体SPMA-OG。该方法结合了稀疏回放、输出蒸馏、关系几何保持、局部平滑及基于旧锚点的图表分配正则化。在具有兼容性偏移的CIFAR10和Tiny-ImageNet典型实验中，SPMA-OG在旧任务保持度和表征保持指标上优于稀疏回放基线，同时在新任务准确率上保持竞争力。在受控的合成图册流形基准测试中，该方法实现了近乎完美的锚点几何保持，同时通过回放机制提升了新任务准确率。这些结果表明，当持续学习需要保持共享潜在支撑而非创建新支撑时，几何感知的锚点正则化是一种有效的归纳偏置。

摘要 (Abstract)

Continual learning methods usually preserve old behavior by regularizing parameters, matching old outputs, or replaying previous examples. These strategies can reduce forgetting, but they do not directly specify how the latent representation should evolve. We study a narrower geometric alternative for the regime where old and new data should remain on the same latent support: continual learning as continuation of a shared manifold. We instantiate this view within Support-Preserving Manifold Assimilation (SPMA) and evaluate a geometry-preserving variant, SPMA-OG, that combines sparse replay, output distillation, relational geometry preservation, local smoothing, and chart-assignment regularization on old anchors. On representative compatible-shift CIFAR10 and Tiny-ImageNet runs, SPMA-OG improves over sparse replay baselines in old-task retention and representation-preservation metrics while remaining competitive on new-task accuracy. On a controlled synthetic atlas-manifold benchmark, it achieves near-perfect anchor-geometry preservation while also improving new-task accuracy over replay. These results provide evidence that geometry-aware anchor regularization is a useful inductive bias when continual learning should preserve a shared latent support rather than create a new one.

关键词: continual learning, shared manifold, geometry preservation, representation learning, catastrophic forgetting, SPMA-OG, anchor regularization, compatible shift

205. ❌ Graph-Informed Adversarial Modeling: Infimal Subadditivity of Interpolative Divergences

作者: Panagiota Birmpa, Eric Joseph Hall 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20025v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是图结构引导的对抗生成网络（GAN）的理论分析，核心贡献是证明了插值散度在贝叶斯网络结构下的下确界次可加性原理，并提出了用局部判别器替代全局判别器的图引导GAN框架。所有评分关键词都聚焦于大语言模型（LLM）及其相关技术（如训练方法、推理优化、应用等），而本文完全不涉及LLM、深度学习模型架构、训练技术或科学AI应用，属于传统生成对抗网络的理论机器学习研究，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文研究了当目标分布按已知贝叶斯网络分解时对抗学习的理论问题，证明了插值散度的下确界次可加性原理，为用图引导的局部判别器GAN替代图无关的全局判别器GAN提供了理论依据。

摘要翻译

本研究探讨了当目标分布依据已知贝叶斯网络因子化时的对抗学习问题。针对插值型散度（包括$(f,Γ)$-散度），我们证明了一种新的下确界次可加性原理：在适当条件下，全局变分差异可由与图结构对齐的族级差异的平均值控制。在加性机制下，该替代度量是精确的。这为将使用单一判别器的图无关生成对抗网络（GAN），替换为使用局部族级判别器的图结构感知GAN提供了变分理论依据。该结果不要求优化器本身依据图结构因子化。我们同样获得了积分概率度量与近端最优传输散度的平行结论，识别了理论适用的自然判别器类别，并通过实验证明，相较于图无关基线方法，所提方法在稳定性和结构恢复方面均有提升。

摘要 (Abstract)

We study adversarial learning when the target distribution factorizes according to a known Bayesian network. For interpolative divergences, including $(f,Γ)$-divergences, we prove a new infimal subadditivity principle showing that, under suitable conditions, a global variational discrepancy is controlled by an average of family-level discrepancies aligned with the graph. In an additive regime, this surrogate is exact. This provides a variational justification for replacing a graph-agnostic GAN with a monolithic discriminator by a graph-informed GAN with localized family-level discriminators. The result does not require the optimizer itself to factorize according to the graph. We also obtain parallel results for integral probability metrics and proximal optimal transport divergences, identify natural discriminator classes for which the theory applies, and present experiments showing improved stability and structural recovery relative to graph-agnostic baselines.

关键词: adversarial learning, Bayesian network, interpolative divergences, infimal subadditivity, variational discrepancy, graph-informed GAN, family-level discriminators, proximal optimal transport

206. ❌ ODySSeI: An Open-Source End-to-End Framework for Automated Detection, Segmentation, and Severity Estimation of Lesions in Invasive Coronary Angiography Images

作者: Anand Choudhary, Xiaowu Sun, Thabo Mahendiran, Ortal Senouf, Denise Auberson, Bernard De Bruyne, Stephane Fournier, Olivier Muller, Emmanuel Abbé, Pascal Frossard, Dorina Thanou 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20021v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于医学图像分析（冠状动脉造影），使用深度学习进行病灶检测、分割和严重性估计，属于AI在生物医学领域的应用。所有关键词均与大模型（LLM）技术、训练方法、推理优化、代理系统等直接相关，而本文未涉及任何大模型技术，仅使用了传统的深度学习模型（如CNN）进行计算机视觉任务。唯一的相关关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在生物医学（Bioinformatics相关）领域的应用，但并非核心创新点，因此给5分（有一定关联）。其他关键词完全无关，给0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个名为ODySSeI的开源端到端框架，用于自动化检测、分割和评估冠状动脉造影图像中病灶的严重程度，以解决临床解释的主观性和变异性问题，并在多中心数据集上验证了其强泛化能力和实时性能。

摘要翻译

有创冠状动脉造影（Invasive Coronary Angiography, ICA）是评估冠状动脉疾病的临床金标准。然而，其判读仍具有主观性，且易受操作者内部及操作者间差异的影响。本研究介绍了ODySSeI：一个用于ICA图像中病变自动检测、分割和严重程度评估的开源端到端框架。ODySSeI集成了基于深度学习的病变检测和病变分割模型，这些模型采用一种新颖的金字塔式增强方案（Pyramidal Augmentation Scheme, PAS）进行训练，以提升在不同患者队列（来自欧洲、北美和亚洲的2149名患者）中的鲁棒性和实时性能。此外，我们提出了一种无需定量冠状动脉造影的病变严重程度评估（Lesion Severity Estimation, LSE）技术，可直接根据预测的病变几何形状计算最小管腔直径（Minimum Lumen Diameter, MLD）和直径狭窄程度。在分布内和分布外临床数据集上的广泛评估表明，ODySSeI具有很强的泛化能力。与相对简单的任务相比，我们的PAS在高度复杂的任务中带来了显著的性能提升，具体而言，病变检测性能相比其基线提高了2.5倍，而病变分割性能则提升了1-3%。我们的LSE技术实现了高精度，预测的MLD值与对应的真实值仅相差±2-3个像素。平均而言，ODySSeI在CPU上仅需数秒、在GPU上仅需几分之一秒即可处理一张原始ICA图像，并可通过swisscardia.epfl.ch上的即插即用网络界面进行访问。总体而言，本研究确立了ODySSeI作为一个全面且开源的框架，支持自动化、可重复和可扩展的ICA分析，以用于实时临床决策。

摘要 (Abstract)

Invasive Coronary Angiography (ICA) is the clinical gold standard for the assessment of coronary artery disease. However, its interpretation remains subjective and prone to intra- and inter-operator variability. In this work, we introduce ODySSeI: an Open-source end-to-end framework for automated Detection, Segmentation, and Severity estimation of lesions in ICA images. ODySSeI integrates deep learning-based lesion detection and lesion segmentation models trained using a novel Pyramidal Augmentation Scheme (PAS) to enhance robustness and real-time performance across diverse patient cohorts (2149 patients from Europe, North America, and Asia). Furthermore, we propose a quantitative coronary angiography-free Lesion Severity Estimation (LSE) technique that directly computes the Minimum Lumen Diameter (MLD) and diameter stenosis from the predicted lesion geometry. Extensive evaluation on both in-distribution and out-of-distribution clinical datasets demonstrates ODySSeI’s strong generalizability. Our PAS yields large performance gains in highly complex tasks as compared to relatively simpler ones, notably, a 2.5-fold increase in lesion detection performance versus a 1-3% increase in lesion segmentation performance over their respective baselines. Our LSE technique achieves high accuracy, with predicted MLD values differing by only $\pm$ 2-3 pixels from the corresponding ground truths. On average, ODySSeI processes a raw ICA image within only a few seconds on a CPU and in a fraction of a second on a GPU and is available as a plug-and-play web interface at swisscardia.epfl.ch. Overall, this work establishes ODySSeI as a comprehensive and open-source framework which supports automated, reproducible, and scalable ICA analysis for real-time clinical decision-making.

关键词: Invasive Coronary Angiography, lesion detection, lesion segmentation, deep learning, Pyramidal Augmentation Scheme, Lesion Severity Estimation, real-time clinical decision-making, open-source framework

207. ❌ A Super Fast K-means for Indexing Vector Embeddings

作者: Leonardo Kuffo, Sven Hepkema, Peter Boncz 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20009v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于向量嵌入聚类的高效k-means算法优化，与所有评分关键词（均围绕大模型技术原理、训练方法、推理优化、对齐、应用等）无直接关联。论文未涉及语言模型、深度学习架构、训练技术、推理加速、对齐方法或科学AI应用，属于底层向量索引和聚类算法研究，与评分关键词领域完全不同。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为SuperKMeans的高效k-means变体，通过维度剪枝和基于召回率的提前终止机制，在保持向量相似性搜索质量的同时，显著加速了高维向量嵌入的聚类过程。

摘要翻译

本文提出SuperKMeans：一种专为高维向量嵌入集合聚类而设计的k-means变体。在现代CPU上，SuperKMeans的聚类速度比FAISS和Scikit-Learn快达7倍；在GPU上比cuVS快达4倍（图1），同时能在向量相似性搜索任务中保持所得质心的质量。SuperKMeans的加速源于通过可靠高效地剪枝在向量分配至质心过程中不必要的维度，从而减少数据访问和计算开销。此外，我们提出了基于召回率的提前终止机制，这是一种新颖的方法，当检索任务中质心质量在迭代过程中不再提升时，提前终止k-means过程。在实际应用中，该机制进一步降低了运行时间，且不影响检索质量。我们在https://github.com/cwida/SuperKMeans开源了实现代码。

摘要 (Abstract)

We present SuperKMeans: a k-means variant designed for clustering collections of high-dimensional vector embeddings. SuperKMeans’ clustering is up to 7x faster than FAISS and Scikit-Learn on modern CPUs and up to 4x faster than cuVS on GPUs (Figure 1), while maintaining the quality of the resulting centroids for vector similarity search tasks. SuperKMeans acceleration comes from reducing data-access and compute overhead by reliably and efficiently pruning dimensions that are not needed to assign a vector to a centroid. Furthermore, we present Early Termination by Recall, a novel mechanism that early-terminates k-means when the quality of the centroids for retrieval tasks stops improving across iterations. In practice, this further reduces runtimes without compromising retrieval quality. We open-source our implementation at https://github.com/cwida/SuperKMeans

关键词: SuperKMeans, k-means clustering, vector embeddings, high-dimensional vectors, retrieval quality, early termination, dimension pruning, similarity search

208. ❌ AgenticRS-EnsNAS: Ensemble-Decoupled Self-Evolving Architecture Search

作者: Yun Chen, Moyu Zhang, Jinxin Hu, Yu Zhang, Xiaoyi Zeng 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20014v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	5.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文主要研究神经架构搜索（NAS）中的集成学习效率问题，提出了一种解耦的架构搜索框架。与大多数关键词无关，因为论文聚焦于NAS和集成学习，而非大模型技术本身。仅与两个关键词有弱关联：1）‘Large Language Models’（权重1.0）：论文提到LLM-driven search作为离散架构搜索的一种方法，但非核心内容，给5分；2）‘LLM Agents’（权重1.0）：论文标题包含’Agentic’，但内容未深入讨论智能体，仅作为框架命名的一部分，给5分。其他关键词如MoE、SFT、RAG等均未涉及。加权总分计算：51.0 + 51.0 = 10.0，远低于动态及格分26.6，表明论文与大模型技术相关性较低。

!!! tip deepseek-chat TL;DR

该论文解决了工业级神经架构搜索中集成验证成本过高的问题，提出了一种解耦的架构搜索框架，通过理论分析和轻量级评估将每个候选架构的搜索成本从O(M)降低到O(1)。

摘要翻译

神经网络架构搜索在工业生产系统中的部署面临一个根本性的验证瓶颈：验证单个候选架构π需要评估由M个模型组成的部署集成，导致每个候选架构产生难以承受的O(M)计算成本。这一成本壁垒严重限制了实际应用中的架构迭代频率，而在这些应用中，集成方法（M=50-200）是保证鲁棒性的标准做法。本文提出了集成解耦架构搜索框架，该框架利用集成理论，通过单学习器评估来预测系统级性能。我们建立了集成解耦理论，并在同质性假设下给出了集成性能单调提升的充分条件：若候选架构π满足ρ(π) < ρ(π_旧) - (M / (M - 1)) * (ΔE(π) / σ²(π))，则其产生的集成误差低于当前基线架构，其中ΔE、ρ和σ²可通过轻量级双学习器训练进行估计。该方法将架构搜索与完整集成训练解耦，将每个候选架构的搜索成本从O(M)降至O(1)，同时仅对验证通过的优胜者保留O(M)的部署成本。我们统一了全流程连续性中的解决策略：（1）对可处理的连续π（以CTR预测中的特征装袋为例）采用闭式优化；（2）对难处理的连续π采用约束可微优化；（3）对离散π采用基于大语言模型驱动搜索与迭代单调接受准则。该框架揭示了两种正交的改进机制——基学习器多样性增益与准确性增益——为工业级神经网络架构搜索提供了可操作的设计原则。所有理论推导均严谨，详细证明见附录。全面的实证验证将包含于本工作的期刊扩展版本中。

摘要 (Abstract)

Neural Architecture Search (NAS) deployment in industrial production systems faces a fundamental validation bottleneck: verifying a single candidate architecture pi requires evaluating the deployed ensemble of M models, incurring prohibitive O(M) computational cost per candidate. This cost barrier severely limits architecture iteration frequency in real-world applications where ensembles (M=50-200) are standard for robustness. This work introduces Ensemble-Decoupled Architecture Search, a framework that leverages ensemble theory to predict system-level performance from single-learner evaluation. We establish the Ensemble-Decoupled Theory with a sufficient condition for monotonic ensemble improvement under homogeneity assumptions: a candidate architecture pi yields lower ensemble error than the current baseline if rho(pi) < rho(pi_old) - (M / (M - 1)) * (Delta E(pi) / sigma^2(pi)), where Delta E, rho, and sigma^2 are estimable from lightweight dual-learner training. This decouples architecture search from full ensemble training, reducing per-candidate search cost from O(M) to O(1) while maintaining O(M) deployment cost only for validated winners. We unify solution strategies across pipeline continuity: (1) closed-form optimization for tractable continuous pi (exemplified by feature bagging in CTR prediction), (2) constrained differentiable optimization for intractable continuous pi, and (3) LLM-driven search with iterative monotonic acceptance for discrete pi. The framework reveals two orthogonal improvement mechanisms – base diversity gain and accuracy gain – providing actionable design principles for industrial-scale NAS. All theoretical derivations are rigorous with detailed proofs deferred to the appendix. Comprehensive empirical validation will be included in the journal extension of this work.

关键词: Neural Architecture Search, Ensemble Learning, Architecture Search, Ensemble-Decoupled Theory, Computational Cost Reduction, Industrial Deployment, Lightweight Evaluation, Monotonic Improvement

209. ❌ Channel Prediction-Based Physical Layer Authentication under Consecutive Spoofing Attacks

作者: Yijia Guo, Junqing Zhang, Yao-Win Peter Hong 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19962v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究无线网络安全中的物理层认证问题，使用Transformer进行信道预测以对抗连续欺骗攻击。所有评分关键词均与大模型、深度学习技术原理或科学AI应用直接相关，而本文仅使用Transformer作为预测模块，未涉及大模型技术、训练方法、推理优化、对齐、代理系统等核心主题，也未应用于生物信息学等科学领域，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于信道预测的物理层认证框架，使用Transformer预测合法信道状态信息，在连续欺骗攻击下显著提高了认证准确性并保持了鲁棒性。

摘要翻译

无线网络极易遭受欺骗攻击，尤其在攻击者连续发送欺骗数据包的情况下。传统的物理层认证方法主要关注单数据包欺骗攻击，但在连续欺骗攻击下，由于设备移动性和信道衰落引起的信道演化，这些方法会失效。为应对这一挑战，我们提出了一种基于信道预测的物理层认证框架。具体而言，该框架采用基于Transformer的信道预测模块来预测欺骗间隔内的合法信道状态信息测量值，并根据认证决策结果，自适应地使用预测或观测到的CSI测量值更新信道预测模块的输入，以确保对持续欺骗攻击的鲁棒性。在瑞利衰落信道下的仿真结果表明，所提方法实现了较低的预测误差，且认证准确率显著高于传统基准方案，即使在长时间欺骗攻击下仍能保持鲁棒性。

摘要 (Abstract)

Wireless networks are highly vulnerable to spoofing attacks, especially when attackers transmit consecutive spoofing packets. Conventional physical layer authentication (PLA) methods have mostly focused on single-packet spoofing attack. However, under consecutive spoofing attacks, they become ineffective due to channel evolution caused by device mobility and channel fading. To address this challenge, we propose a channel prediction-based PLA framework. Specifically, a Transformer-based channel prediction module is employed to predict legitimate CSI measurements during spoofing interval, and the input of channel prediction module is adaptively updated with predicted or observed CSI measurements based on the authentication decision to ensure robustness against sustained spoofing. Simulation results under Rayleigh fading channels demonstrate that the proposed approach achieves low prediction error and significantly higher authentication accuracy than conventional benchmark, maintaining robustness even under extended spoofing attacks.

关键词: Physical Layer Authentication, Spoofing Attacks, Channel Prediction, Transformer, CSI Measurements, Wireless Security, Rayleigh Fading

210. ❌ Model-Driven Learning-Based Physical Layer Authentication for Mobile Wi-Fi Devices

作者: Yijia Guo, Junqing Zhang, Yao-Win Peter Hong, Stefano Tomasin 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19972v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究物理层认证（PLA）方案，提出了一种基于假设检验驱动的轻量级神经网络（LiteNP-Net），用于Wi-Fi设备的无线信道认证。论文的核心是无线通信安全、物理层认证和深度学习应用，但所有关键词均围绕大模型（LLMs）、深度学习技术原理创新或AI在科学领域的应用，而本文的深度学习应用仅限于特定无线认证任务，未涉及大模型、大模型技术原理或AI在科学领域的创新应用，因此所有关键词相关度为0。

!!! tip deepseek-chat TL;DR

该论文针对无线物联网设备的物理层认证问题，提出了一种基于假设检验驱动的轻量级神经网络LiteNP-Net，在仿真和实际Wi-Fi环境中验证了其优于传统方法和先进Siamese方法的性能。

摘要翻译

无线技术的兴起使物联网（IoT）无处不在，但无线通信的广播特性使物联网面临身份验证风险。物理层身份验证（Physical Layer Authentication, PLA）通过利用无线信道的独特特性提供了一种前景广阔的解决方案。作为PLA的常见方法，假设检验可推导出理论最优的奈曼-皮尔逊（Neyman-Pearson, NP）检测器，但其对信道统计特性的依赖限制了其在现实场景中的实用性。相比之下，基于深度学习的PLA方法具有实用性但往往并非最优。为应对这些挑战，我们提出了一种由假设检验驱动的基于学习的PLA方案，并利用Wi-Fi进行了大量仿真与实验评估。具体而言，我们将条件统计模型纳入假设检验框架，推导出理论最优的NP检测器。在此基础上，我们开发了LiteNP-Net——一种由NP检测器驱动的轻量级神经网络。仿真结果表明，即使在没有先验信道统计知识的情况下，LiteNP-Net也能逼近NP检测器的性能。为在实际环境中进一步评估其有效性，我们在多种现实场景中使用Wi-Fi物联网开发套件部署了实验测试平台。实验结果表明，LiteNP-Net的性能优于传统的基于相关性的方法以及最先进的基于孪生网络的方法。

摘要 (Abstract)

The rise of wireless technologies has made the Internet of Things (IoT) ubiquitous, but the broadcast nature of wireless communications exposes IoT to authentication risks. Physical layer authentication (PLA) offers a promising solution by leveraging unique characteristics of wireless channels. As a common approach in PLA, hypothesis testing yields a theoretically optimal Neyman-Pearson (NP) detector, but its reliance on channel statistics limits its practicality in real-world scenarios. In contrast, deep learning-based PLA approaches are practical but tend to be not optimal. To address these challenges, we proposed a learning-based PLA scheme driven by hypothesis testing and conducted extensive simulations and experimental evaluations using Wi-Fi. Specifically, we incorporated conditional statistical models into the hypothesis testing framework to derive a theoretically optimal NP detector. Building on this, we developed LiteNP-Net, a lightweight neural network driven by the NP detector. Simulation results demonstrated that LiteNP-Net could approach the performance of the NP detector even without prior knowledge of the channel statistics. To further assess its effectiveness in practical environments, we deployed an experimental testbed using Wi-Fi IoT development kits in various real-world scenarios. Experimental results demonstrated that the LiteNP-Net outperformed the conventional correlation-based method as well as state-of-the-art Siamese-based methods.

关键词: Physical Layer Authentication, Wireless IoT, Hypothesis Testing, Neyman-Pearson Detector, Deep Learning, Wi-Fi, LiteNP-Net, Channel Statistics

211. ❌ Structural Controllability of Large-Scale Hypergraphs

作者: Joshua Pickard, Xin Mao, Can Chen 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19955v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文研究超图的结构可控性，属于网络科学和复杂系统领域，与深度学习、大模型技术无直接关联。所有关键词均涉及大模型技术栈（训练、推理、对齐、应用等），论文未涉及这些内容。仅’AI for Science’关键词因论文属于网络科学（可视为科学应用）而获得5分（有一定关联），但论文未使用AI方法解决科学问题，而是使用数学控制理论方法。

!!! tip deepseek-chat TL;DR

该论文针对具有高阶交互的大规模超图，建立了结构可控性理论框架，并设计了可扩展的驱动节点选择算法。

摘要翻译

控制现实世界中具有高阶相互作用的网络系统，包括生态、生物医学和工程网络，由于其固有的非线性和庞大的系统规模，仍然面临挑战。尽管对图可控性已有广泛研究，但超图的可控性特性在很大程度上仍未得到充分发展。现有成果主要集中于精确可控性，这对于大规模超图往往不切实际。本文通过将超图动力学建模为多项式动力系统，建立了超图的结构可控性框架。具体而言，我们将基于线性图系统的经典可达性和扩张概念扩展到多项式超图动力学，并建立了一个基于超图的判据，在该判据下，拓扑结构保证了对于几乎所有参数选择都能满足经典的李代数型（Lie-algebraic）和卡尔曼型（Kalman）秩条件。我们进一步推导了结构可控性所需最少驱动节点数的一个基于拓扑的下界，并利用该下界设计了一种可扩展的驱动节点选择算法，该算法结合了通过最大匹配实现的扩张感知初始化与贪婪可达性扩展。我们通过在具有数十至数千个节点及高阶相互作用的超图上进行数值实验，证明了所提框架的有效性和可扩展性。

摘要 (Abstract)

Controlling real-world networked systems, including ecological, biomedical, and engineered networks that exhibit higher-order interactions, remains challenging due to inherent nonlinearities and large system scales. Despite extensive studies on graph controllability, the controllability properties of hypergraphs remain largely underdeveloped. Existing results focus primarily on exact controllability, which is often impractical for large-scale hypergraphs. In this article, we develop a structural controllability framework for hypergraphs by modeling hypergraph dynamics as polynomial dynamical systems. In particular, we extend classical notions of accessibility and dilation from linear graph-based systems to polynomial hypergraph dynamics and establish a hypergraph-based criterion under which the topology guarantees satisfaction of classical Lie-algebraic and Kalman-type rank conditions for almost all parameter choices. We further derive a topology-based lower bound on the minimum number of driver nodes required for structural controllability and leverage this bound to design a scalable driver node selection algorithm combining dilation-aware initialization via maximum matching with greedy accessibility expansion. We demonstrate the effectiveness and scalability of the proposed framework through numerical experiments on hypergraphs with tens to thousands of nodes and higher-order interactions.

关键词: hypergraph controllability, structural controllability, higher-order interactions, polynomial dynamical systems, driver node selection, large-scale networks, networked systems

212. ❌ Memori: A Persistent Memory Layer for Efficient, Context-Aware LLM Agents

作者: Luiz C. Borro, Luiz A. B. Macarini, Gordon Tindall, Michael Montero, Adam B. Struck 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19935v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	8.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	8.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究LLM代理的持久内存层，与’Large Language Models’和’LLM Agents’高度相关（10分）。通过结构化对话记忆实现精确检索，与’Retrieval-Augmented Generation’相关（8分）。解决现有方法依赖大上下文窗口的问题，与’Context Window Extension’相关（8分）。其他关键词如MoE、SFT、RLHF等未涉及，得0分。

!!! tip deepseek-chat TL;DR

论文提出Memori，一种LLM无关的持久内存层，通过将非结构化对话转换为紧凑的语义三元组和对话摘要，实现了精确检索和连贯推理，在LoCoMo基准测试中达到81.95%准确率，同时将每查询令牌数减少至约5%的完整上下文，显著降低了成本。

摘要翻译

随着大语言模型（LLM）演化为自主智能体，在API层实现持久性记忆对于支持跨LLM的上下文感知行为以及多轮会话交互至关重要。现有方法通常导致厂商锁定，并依赖向提示中注入大量原始对话内容，这会产生高昂的令牌成本并导致性能下降。我们提出了Memori，一个与LLM无关的持久性记忆层，它将记忆视为一个数据结构化问题。其高级增强处理流程将非结构化对话转换为紧凑的语义三元组和会话摘要，从而实现精确检索与连贯推理。在LoCoMo基准测试中，Memori达到了81.95%的准确率，优于现有记忆系统，同时每次查询仅使用1,294个令牌（约占完整上下文的5%）。这带来了显著的成本降低：相比竞争方法减少了67%的令牌使用量，与完整上下文方法相比节省了超过20倍的成本。这些结果表明，LLM智能体中的有效记忆依赖于结构化表征而非更大的上下文窗口，从而实现了可扩展且经济高效的部署。

摘要 (Abstract)

As large language models (LLMs) evolve into autonomous agents, persistent memory at the API layer is essential for enabling context-aware behavior across LLMs and multi-session interactions. Existing approaches force vendor lock-in and rely on injecting large volumes of raw conversation into prompts, leading to high token costs and degraded performance. We introduce Memori, an LLM-agnostic persistent memory layer that treats memory as a data structuring problem. Its Advanced Augmentation pipeline converts unstructured dialogue into compact semantic triples and conversation summaries, enabling precise retrieval and coherent reasoning. Evaluated on the LoCoMo benchmark, Memori achieves 81.95% accuracy, outperforming existing memory systems while using only 1,294 tokens per query (~5% of full context). This results in substantial cost reductions, including 67% fewer tokens than competing approaches and over 20x savings compared to full-context methods. These results show that effective memory in LLM agents depends on structured representations instead of larger context windows, enabling scalable and cost-efficient deployment.

关键词: LLM agents, persistent memory, context-aware, semantic triples, conversation summaries, retrieval, token efficiency, cost reduction

213. ❌ TAPAS: Efficient Two-Server Asymmetric Private Aggregation Beyond Prio(+)

作者: Harish Karthikeyan, Antigoni Polychroniadou 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19949v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文TAPAS专注于隐私保护聚合协议，属于密码学和分布式系统领域，与所有评分关键词（均围绕大模型、深度学习技术及其应用）完全无关。论文未涉及任何大模型架构、训练、推理、对齐、应用或相关技术。

!!! tip deepseek-chat TL;DR

论文提出了一种名为TAPAS的高效两服务器非对称隐私聚合方案，解决了现有协议在服务器通信成本、可扩展性和安全性方面的限制，其核心贡献包括基于格密码的零知识证明和可识别的中止机制。

摘要翻译

隐私保护聚合是人工智能系统从分布式数据中学习而不暴露个体记录的基石，尤其在联邦学习和遥测技术中。现有的双服务器协议（如Prio及其后续方案）通过验证输入同时防止任何单一方获知用户数据，设定了实用的基准，但它们对两个服务器施加了对称的成本，且通信开销随每个客户端的输入维度$L$线性增长。现代学习任务通常涉及数千万至数亿模型参数的维度$L$。
我们提出了TAPAS，一种双服务器非对称隐私聚合方案，从四个维度解决了这些局限：（i）无需可信设置或预处理，（ii）服务器端通信独立于$L$，（iii）基于标准格假设（LWE、SIS）的后量子安全性，以及（iv）更强的鲁棒性，具备可识别的终止机制和对服务器的完全恶意安全性。一个关键设计选择是刻意引入非对称性：一个服务器承担$O(L)$的聚合与验证工作，而另一个服务器作为轻量级协调者，其计算独立于$L$。这降低了总成本，使辅助服务器能在商用硬件上运行，并强化了服务器间的非共谋假设。我们的主要贡献之一是一套新颖高效的基于格的零知识证明方案；据我们所知，我们首次在双服务器环境中实现了具备可识别终止机制的隐私性与正确性保障。

摘要 (Abstract)

Privacy-preserving aggregation is a cornerstone for AI systems that learn from distributed data without exposing individual records, especially in federated learning and telemetry. Existing two-server protocols (e.g., Prio and successors) set a practical baseline by validating inputs while preventing any single party from learning users’ values, but they impose symmetric costs on both servers and communication that scales with the per-client input dimension $L$. Modern learning tasks routinely involve dimensionalities $L$ in the tens to hundreds of millions of model parameters. We present TAPAS, a two-server asymmetric private aggregation scheme that addresses these limitations along four dimensions: (i) no trusted setup or preprocessing, (ii) server-side communication that is independent of $L$ (iii) post-quantum security based solely on standard lattice assumptions (LWE, SIS), and (iv) stronger robustness with identifiable abort and full malicious security for the servers. A key design choice is intentional asymmetry: one server bears the $O(L)$ aggregation and verification work, while the other operates as a lightweight facilitator with computation independent of $L$. This reduces total cost, enables the secondary server to run on commodity hardware, and strengthens the non-collusion assumption of the servers. One of our main contributions is a suite of new and efficient lattice-based zero-knowledge proofs; to our knowledge, we are the first to establish privacy and correctness with identifiable abort in the two-server setting.

关键词: private aggregation, two-server protocol, lattice-based cryptography, zero-knowledge proofs, identifiable abort, post-quantum security, federated learning, asymmetric computation

214. ❌ Infinite-dimensional spherical-radial decomposition for probabilistic functions, with application to constrained optimal control and Gaussian process regression

作者: Kewei Wang, Georg Stadler 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19907v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于概率函数估计、随机优化和数值方法，研究球形径向分解（SRD）在无限维度的扩展及其在随机PDE最优控制和Gaussian过程回归中的应用。所有关键词均涉及大模型、深度学习、AI技术或特定AI应用领域，而本文属于数学、优化和计算科学领域，与AI/大模型技术无直接关联，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文将球形径向分解方法扩展到无限随机维度，提出了混合无限维SRD方法，用于无偏、低方差地估计概率函数及其梯度，并应用于带机会约束的随机PDE最优控制和Gaussian过程回归的核参数优化。

摘要翻译

球面径向分解（SRD）是一种用于估计定义在有限维椭圆分布上的概率函数及其梯度的有效方法。本研究通过将子空间SRD与标准蒙特卡洛方法相结合，将SRD推广至无限随机维度。所提出的方法——我们称之为混合无限维SRD（hiSRD）——为概率约束优化等问题中出现的凸集提供了无偏、低方差的估计量。我们对有限维SRD在维度增加时的方差进行了理论分析，并证明所提出的混合方法消除了截断引起的偏差，降低了方差，同时支持概率函数导数的计算。我们通过数值实验进行了全面验证：其一为具有联合机会状态约束的风险中性随机偏微分方程最优控制问题，其二为在高斯过程回归中优化核参数，并约束后验过程满足联合机会约束条件。

摘要 (Abstract)

The spherical-radial decomposition (SRD) is an efficient method for estimating probabilistic functions and their gradients defined over finite-dimensional elliptical distributions. In this work, we generalize the SRD to infinite stochastic dimensions by combining subspace SRD with standard Monte Carlo methods. The resulting method, which we call hybrid infinite-dimensional SRD (hiSRD) provides an unbiased, low-variance estimator for convex sets arising, for instance, in chance-constrained optimization. We provide a theoretical analysis of the variance of finite-dimensional SRD as the dimension increases, and show that the proposed hybrid method eliminates truncation-induced bias, reduces variance, and allows the computation of derivatives of probabilistic functions. We present comprehensive numerical studies for a risk-neutral stochastic PDE optimal control problem with joint chance state constraints, and for optimizing kernel parameters in Gaussian process regression under the constraint that the posterior process satisfies joint chance constraints.

关键词: spherical-radial decomposition, infinite-dimensional, probabilistic functions, chance-constrained optimization, stochastic PDE optimal control, Gaussian process regression, Monte Carlo methods, variance reduction

215. ❌ Discovery of Decision Synchronization Patterns from Event Logs

作者: Tijmen Kuijpers, Karolin Winter, Remco Dijkman 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19879v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究业务过程中的决策同步模式发现，属于传统过程挖掘领域，完全不涉及大模型、深度学习、AI for Science或任何评分关键词中的技术。论文专注于事件日志分析、过程模型和约束发现，与所有评分关键词的技术范畴无关联。

!!! tip deepseek-chat TL;DR

该论文提出了一种从事件日志中发现业务过程中决策同步模式的方法，通过形式化和评估验证了该方法能够可靠地识别四种特定模式。

摘要翻译

业务流程中运行案例间的决策同步有助于公平高效地利用资源，助力优先处理最具价值的案例，并避免不必要的等待。因此，决策同步模式通常以特定机制的形式被嵌入流程中，这些机制会通过暂时延迟某个案例来优先处理另一个案例。此类决策机制需同时考虑多个案例的特性，而非仅关注单一案例的属性——这一维度在当前流程挖掘技术中鲜有涉及。为填补这一空白，本文提出一种受供应链流程启发的决策同步模式挖掘方法。这些决策同步模式表现为特定的流程结构，并结合了用于确定执行哪个具体案例的约束条件。我们阐述、形式化并演示了如何挖掘四种此类模式的约束条件。我们在两种仿真场景中对方法进行了评估：首先，使用四个各自仅包含单一决策同步模式的独立流程模型，验证了当仅存在单一模式时，本方法能准确挖掘所有模式类型；其次，通过包含全部四种决策同步模式的流程模型，展示了该方法对更复杂问题的泛化能力。在两种场景中，我们均能可靠地提取出预期的模式。

摘要 (Abstract)

Synchronizing decisions between running cases in business processes facilitates fair and efficient use of resources, helps prioritize the most valuable cases, and prevents unnecessary waiting. Consequently, decision synchronization patterns are regularly built into processes, in the form of mechanisms that temporarily delay one case to favor another. These decision mechanisms therefore consider properties of multiple cases at once, rather than just the properties of a single case; an aspect that is rarely addressed by current process discovery techniques. To address this gap, this paper proposes an approach for discovering decision synchronization patterns inspired by supply chain processes. These decision synchronization patterns take the form of specific process constructs combined with a constraint that determines which particular case to execute. We describe, formalize and demonstrate how the constraint for four such patterns can be discovered. We evaluate our approach in two artificial scenarios. First, with four separate process models each containing a single decision synchronization pattern, i.e., we demonstrate that our approach can discover every type of pattern when only this one type is present. Second, we consider a process model containing all four decision synchronization patterns to show generalizability of the approach to more complex problems. For both scenarios, we could reliably retrieve the expected patterns.

关键词: decision synchronization patterns, event logs, process discovery, business processes, supply chain processes, process constructs, constraint discovery, process models

216. ❌ Deep Autocorrelation Modeling for Time-Series Forecasting: Progress and Prospects

作者: Hao Wang, Licheng Pan, Qingsong Wen, Jialin Yu, Zhichao Chen, Chunyuan Zheng, Xiaoxi Li, Zhixuan Chu, Chao Xu, Mingming Gong, Haoxuan Li, Yuan Lu, Zhouchen Lin, Philip Torr, Yan Liu 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19899v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文是一篇关于深度时间序列预测中自相关建模的系统综述，专注于传统深度学习架构（如RNN、CNN、Transformer）在时间序列分析中的应用，不涉及大语言模型（LLM）、大模型技术原理或AI for Science的具体应用。所有评分关键词均与大模型技术、训练方法、推理优化、对齐技术、代理系统或科学AI应用相关，而本文主题是时间序列预测的深度学习模型综述，与这些关键词无直接关联。

!!! tip deepseek-chat TL;DR

该论文系统综述了深度时间序列预测中自相关建模的研究进展，提出了涵盖模型架构和学习目标的新分类法，并从自相关视角分析了该领域的演变。

摘要翻译

自相关是时间序列数据的一个定义性特征，其中每个观测值在统计上都依赖于其先前的观测值。在深度时间序列预测的背景下，自相关同时存在于输入历史序列和标签序列中，由此引出两个核心研究挑战：（1）设计能够建模历史序列自相关的神经架构，以及（2）设计能够建模标签序列自相关的学习目标。近期研究在应对这些挑战方面取得了进展，但一个同时审视这两个方面的系统性综述仍然缺乏。为填补这一空白，本文从自相关建模的视角对深度时间序列预测进行了全面回顾。与现有综述相比，本研究做出了两个独特贡献。首先，它提出了一种新颖的分类法，涵盖了关于模型架构和学习目标的最新文献——而以往的综述忽视或未能充分讨论后者。其次，它从一个统一的、以自相关为中心的视角，对所综述文献的动机、见解和发展脉络进行了透彻分析，从而提供了关于深度时间序列预测演进的整体概览。完整的论文列表及相关资源可在 https://github.com/Master-PLC/Awesome-TSF-Papers 获取。

摘要 (Abstract)

Autocorrelation is a defining characteristic of time-series data, where each observation is statistically dependent on its predecessors. In the context of deep time-series forecasting, autocorrelation arises in both the input history and the label sequences, presenting two central research challenges: (1) designing neural architectures that model autocorrelation in history sequences, and (2) devising learning objectives that model autocorrelation in label sequences. Recent studies have made strides in tackling these challenges, but a systematic survey examining both aspects remains lacking. To bridge this gap, this paper provides a comprehensive review of deep time-series forecasting from the perspective of autocorrelation modeling. In contrast to existing surveys, this work makes two distinctive contributions. First, it proposes a novel taxonomy that encompasses recent literature on both model architectures and learning objectives – whereas prior surveys neglect or inadequately discuss the latter aspect. Second, it offers a thorough analysis of the motivations, insights, and progression of the surveyed literature from a unified, autocorrelation-centric perspective, providing a holistic overview of the evolution of deep time-series forecasting. The full list of papers and resources is available at https://github.com/Master-PLC/Awesome-TSF-Papers.

关键词: time-series forecasting, autocorrelation modeling, deep learning, neural architectures, learning objectives, survey, taxonomy, temporal dependencies

217. ❌ Minimax Generalized Cross-Entropy

作者: Kartheek Bondugula, Santiago Mazuelas, Aritz Pérez, Anqi Liu 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19874v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于监督分类中的损失函数优化，提出了一种新的Minimax Generalized Cross-Entropy (MGCE)损失函数，旨在解决现有广义交叉熵的非凸优化问题，提高鲁棒性和收敛速度。所有评分关键词均与大模型、深度学习技术原理或AI在科学领域的应用直接相关，而本文研究的是传统机器学习中的损失函数理论，不涉及大模型架构、训练方法、推理优化、对齐技术、代理系统或科学AI应用等任何关键词领域，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

本文提出了一种新的Minimax Generalized Cross-Entropy (MGCE)损失函数，通过凸优化解决了现有广义交叉熵的非凸问题，在基准数据集上实现了更高的准确性、更快的收敛速度和更好的校准性能，特别是在存在标签噪声的情况下。

摘要翻译

损失函数在监督分类中起着核心作用。交叉熵损失被广泛使用，而平均绝对误差损失虽能提供鲁棒性却难以优化。广义交叉熵通过在交叉熵与平均绝对误差损失之间进行插值，近期被提出以在优化难度与鲁棒性之间取得平衡。现有广义交叉熵的构建方式会导致分类间隔上的非凸优化问题，容易产生欠拟合，在复杂数据集上表现不佳。本文提出广义交叉熵的极小极大化构建方法，该形式可转化为分类间隔上的凸优化问题。此外，我们证明该广义交叉熵极小极大化形式能为分类误差提供上界。所提出的双层凸优化可通过隐函数微分计算的随机梯度进行高效实现。基于基准数据集的实验表明，该方法在保持高精度的同时实现了更快收敛与更优的校准性能，尤其在存在标签噪声的场景下表现突出。

摘要 (Abstract)

Loss functions play a central role in supervised classification. Cross-entropy (CE) is widely used, whereas the mean absolute error (MAE) loss can offer robustness but is difficult to optimize. Interpolating between the CE and MAE losses, generalized cross-entropy (GCE) has recently been introduced to provide a trade-off between optimization difficulty and robustness. Existing formulations of GCE result in a non-convex optimization over classification margins that is prone to underfitting, leading to poor performances with complex datasets. In this paper, we propose a minimax formulation of generalized cross-entropy (MGCE) that results in a convex optimization over classification margins. Moreover, we show that MGCEs can provide an upper bound on the classification error. The proposed bilevel convex optimization can be efficiently implemented using stochastic gradient computed via implicit differentiation. Using benchmark datasets, we show that MGCE achieves strong accuracy, faster convergence, and better calibration, especially in the presence of label noise.

关键词: loss functions, generalized cross-entropy, minimax formulation, convex optimization, robustness, label noise, classification margins, stochastic gradient

218. ❌ On the Dynamics & Transferability of Latent Generalization during Memorization

作者: Simran Ketha, Venkatakrishnan Ramaswamy 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19865v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究深度网络在标签混乱训练数据下的记忆化现象及其潜在泛化能力，属于深度学习基础理论研究，与所有关键词（均聚焦大模型技术、应用及优化方法）无直接关联。

!!! tip deepseek-chat TL;DR

该论文研究了深度网络在记忆混乱标签数据时，其内部表征仍保留潜在泛化能力的动态特性，并开发了线性探针来提取和转移这种泛化能力。

摘要翻译

深度网络以其卓越的泛化能力而闻名，但其内在机制尚未被充分理解。众所周知，即使在训练数据中不同程度地打乱标签，采用标准方法训练的深度网络仍能在此类被破坏的训练数据上达到完美或较高的准确率。这种现象被称为记忆化，其代价通常是模型对真实标签的泛化能力下降。我们近期的研究表明，此类模型的内部表征所保留的潜在泛化能力，远优于模型直接表现出的水平。具体而言，已有工作证明，通过对模型的逐层表征施加简单的探测（称为MASC探针），可以恢复这种潜在泛化能力。然而，在记忆化过程中，这种潜在泛化的起源及其在训练中的动态变化尚不明确。本文通过实证方法追踪训练动态，发现潜在泛化能力与模型泛化能力类似，主要在训练早期达到峰值。接着，我们探究MASC探针的具体特性在多大程度上决定了我们从模型逐层输出中提取潜在泛化的能力。为此，我们首先分析了MASC探针的数学结构，证明其是一种二次分类器，即非线性分类器。这引出了一个关键问题：潜在泛化能力在多大程度上可以从逐层输出中线性解码？为探究此问题，我们为此场景设计了一种新的线性探针。最后，我们探讨是否可能通过直接编辑模型权重，将潜在泛化能力转移至模型泛化能力。基于此，我们提出一种方法，利用新型线性探针将存在于末层表征中的潜在泛化转移至模型本身。

摘要 (Abstract)

Deep networks have been known to have extraordinary generalization abilities, via mechanisms that aren’t yet well understood. It is also known that upon shuffling labels in the training data to varying degrees, deep networks, trained with standard methods, can still achieve perfect or high accuracy on this corrupted training data. This phenomenon is called memorization, and typically comes at the cost of poorer generalization to true labels. Our recent work has demonstrated, that the internal representations of such models retain significantly better latent generalization abilities than is directly apparent from the model. In particular, it has been shown that such latent generalization can be recovered via simple probes (called MASC probes) on the layer-wise representations of the model. However, the origin and dynamics over training of this latent generalization during memorization is not well understood. Here, we track the training dynamics, empirically, and find that latent generalization abilities largely peak early in training, with model generalization. Next, we investigate to what extent the specific nature of the MASC probe is critical for our ability to extract latent generalization from the model’s layerwise outputs. To this end, we first examine the mathematical structure of the MASC probe and show that it is a quadratic classifier, i.e. is non-linear. This brings up the question of the extent to which this latent generalization might be linearly decodable from layerwise outputs. To investigate this, we designed a new linear probe for this setting. Next, we consider the question of whether it is possible to transfer latent generalization to model generalization by directly editing model weights. To this end, we devise a way to transfer the latent generalization present in last-layer representations to the model using the new linear probe.

关键词: deep networks, memorization, latent generalization, training dynamics, linear probe, model generalization, internal representations, MASC probe

219. ❌ NASimJax: GPU-Accelerated Policy Learning Framework for Penetration Testing

作者: Raphael Simon, José Carrasquel, Wim Mees, Pieter Libin 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19864v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文专注于强化学习在网络安全渗透测试中的应用，开发了NASimJax框架以加速训练。所有关键词均与大模型、深度学习技术原理或AI for Science相关，但论文未涉及大模型或深度学习技术，仅与AI for Science有微弱关联（网络安全属于广义AI应用），因此除’AI for Science’得5分外，其余均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了NASimJax框架，通过GPU加速和JAX重写将网络攻击模拟器的吞吐量提升100倍，解决了渗透测试中强化学习策略训练速度慢、泛化能力差的问题，并研究了动作空间扩展和跨网络泛化。

摘要翻译

渗透测试作为通过模拟网络攻击识别漏洞的实践，是一种复杂的序贯决策任务，其本质具有部分可观测性且动作空间庞大。在该领域训练强化学习策略面临一个根本性瓶颈：现有模拟器速度过慢，难以在真实网络场景中进行规模化训练，导致所得策略泛化能力不足。我们提出了NASimJax——基于JAX框架对网络攻击模拟器（NASim）的完整重实现，其环境吞吐量相比原模拟器最高提升达100倍。通过将整个训练流程部署在硬件加速器上，NASimJax使得在固定计算预算下对更大规模网络进行实验成为可能，而这在以往是无法实现的。我们将自动化渗透测试形式化为上下文部分可观测马尔可夫决策过程（Contextual POMDP），并引入一种网络生成流程，可产生结构多样且保证可解的测试场景。这些工作共同为零样本策略泛化研究提供了理论基础。我们利用该框架研究了动作空间扩展及在多达40台主机网络间的泛化性能。研究发现：相较于领域随机化（Domain Randomization），优先级关卡重放（Prioritized Level Replay）能更好地处理密集训练分布，尤其在更大规模场景中；在稀疏拓扑结构上训练可形成隐式课程学习，即使面对训练中未出现过的更密集拓扑，也能提升分布外泛化能力。为处理线性增长的动作空间，我们提出两阶段动作分解方法（2SAS），其在大规模场景中显著优于扁平动作掩码方法。最后，我们揭示了优先级关卡重放的回合重置机制与2SAS信用分配结构相互作用产生的失效模式。NASimJax由此为推进基于强化学习的渗透测试研究提供了一个快速、灵活且贴近现实的平台。

摘要 (Abstract)

Penetration testing, the practice of simulating cyberattacks to identify vulnerabilities, is a complex sequential decision-making task that is inherently partially observable and features large action spaces. Training reinforcement learning (RL) policies for this domain faces a fundamental bottleneck: existing simulators are too slow to train on realistic network scenarios at scale, resulting in policies that fail to generalize. We present NASimJax, a complete JAX-based reimplementation of the Network Attack Simulator (NASim), achieving up to 100x higher environment throughput than the original simulator. By running the entire training pipeline on hardware accelerators, NASimJax enables experimentation on larger networks under fixed compute budgets that were previously infeasible. We formulate automated penetration testing as a Contextual POMDP and introduce a network generation pipeline that produces structurally diverse and guaranteed-solvable scenarios. Together, these provide a principled basis for studying zero-shot policy generalization. We use the framework to investigate action-space scaling and generalization across networks of up to 40 hosts. We find that Prioritized Level Replay better handles dense training distributions than Domain Randomization, particularly at larger scales, and that training on sparser topologies yields an implicit curriculum that improves out-of-distribution generalization, even on topologies denser than those seen during training. To handle linearly growing action spaces, we propose a two-stage action decomposition (2SAS) that substantially outperforms flat action masking at scale. Finally, we identify a failure mode arising from the interaction between Prioritized Level Replay’s episode-reset behaviour and 2SAS’s credit assignment structure. NASimJax thus provides a fast, flexible, and realistic platform for advancing RL-based penetration testing.

关键词: penetration testing, reinforcement learning, GPU acceleration, JAX, network simulation, policy generalization, action space scaling, Contextual POMDP

220. ❌ Modeling subgrid scale production rates on complex meshes using graph neural networks

作者: Priyabrat Dash, Mathis Bode, Konduri Aditya 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19841v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于使用图神经网络（GNN）解决大涡模拟（LES）中的闭合问题，属于计算流体力学和科学计算领域。虽然它应用了深度学习（GNN），但所有关键词都直接与大语言模型（LLM）及其相关技术（如MoE、SFT、RLHF、RAG、Agent等）或特定的大模型技术（如量化、注意力机制）相关。论文未涉及任何形式的大语言模型、基础模型或指令调优等内容。唯一略有相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于AI在科学计算（计算流体力学）中的应用，但并非生物信息学或化学信息学，因此给予5分（有一定关联）。其他关键词均完全无关，得0分。

!!! tip deepseek-chat TL;DR

该研究开发了一种图神经网络（GNN）模型，用于预测非均匀网格上湍流预混火焰的过滤物种生成速率，以解决大涡模拟（LES）中的闭合问题，并在不同氢混合比和滤波器宽度下展示了比传统方法更低的误差和更好的泛化能力。

摘要翻译

大涡模拟（LES）需要对滤波后的生成率建立闭合模型，因为解析场无法包含所有影响化学源项的关联信息。本研究开发了一种图神经网络（GNN），能够根据滤波后的质量分数和温度输入，在非均匀网格上预测滤波组分生成率。数据来源于对氢气体积分数分别为10%、50%和80%的湍流预混氢-甲烷射流火焰进行的直接数值模拟。所有场均采用与操作网格相匹配的滤波宽度进行Favre滤波，并在基于网格点连接性构建的子域图上进行学习。研究选取了一组紧凑的反应物、中间体和产物组分，其滤波生成率作为学习目标。模型使用10%和80%混合比例的数据进行训练，并在未参与训练的50%混合比例数据上进行评估，以测试其跨组分泛化能力。该GNN模型与两种方法进行了比较：一种是在滤波状态下直接计算生成率的非闭合参考方法，另一种是需要重新网格化的卷积神经网络基准模型。在分布内和分布外案例中，GNN均表现出更低的误差，且其统计特性与参考数据更为吻合。此外，该模型在不同滤波宽度下展现出强大的泛化能力，无需重新训练即可在更粗的空间分辨率下保持误差有界。后向台阶构型的测试进一步验证了模型在实际相关几何结构上的预测有效性。这些结果凸显了图神经网络作为复杂网格上大涡模拟的鲁棒数据驱动闭合模型的潜力。

摘要 (Abstract)

Large-eddy simulations (LES) require closures for filtered production rates because the resolved fields do not contain all correlations that govern chemical source terms. We develop a graph neural network (GNN) that predicts filtered species production rates on non-uniform meshes from inputs of filtered mass fractions and temperature. Direct numerical simulations of turbulent premixed hydrogen-methane jet flames with hydrogen fractions of 10%, 50%, and 80% provide the dataset. All fields are Favre filtered with the filter width matched to the operating mesh, and learning is performed on subdomain graphs constructed from mesh-point connectivity. A compact set of reactants, intermediates, and products is used, and their filtered production rates form the targets. The model is trained on 10% and 80% blends and evaluated on the unseen 50% blend to test cross-composition generalization. The GNN is compared against an unclosed reference that evaluates rates at the filtered state, and a convolutional neural network baseline that requires remeshing. Across in-distribution and out-of-distribution cases, the GNN yields lower errors and closer statistical agreement with the reference data. Furthermore, the model demonstrates robust generalization across varying filter widths without retraining, maintaining bounded errors at coarser spatial resolutions. A backward facing step configuration further confirms prediction efficacy on a practically relevant geometry. These results highlight the capability of GNNs as robust data-driven closure models for LES on complex meshes.

关键词: graph neural networks, large-eddy simulations, filtered production rates, turbulent premixed flames, non-uniform meshes, data-driven closure models, generalization, computational fluid dynamics

221. ❌ Explainable cluster analysis: a bagging approach

作者: Federico Maria Quetti, Elena Ballante, Silvia Figini, Paolo Giudici 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19840v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	8.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于聚类分析的可解释性方法，提出了一种基于bagging和特征dropout的集成聚类框架，与大多数关键词（涉及大模型、深度学习、训练技术、推理优化等）完全无关。唯一相关的关键词是’Mechanistic Interpretability OR Explainable AI’，因为论文的核心是解决聚类方法的可解释性问题，通过特征重要性评分来解释聚类结果，这与可解释AI的目标一致，但论文未涉及深度学习或大模型的可解释性，因此评分为8分（有一定关联但非核心）。

!!! tip deepseek-chat TL;DR

该论文针对聚类方法缺乏可解释性的问题，提出了一种基于bagging和特征dropout的集成聚类框架，通过计算特征与聚类标签的互信息来生成特征重要性评分，从而在提高聚类稳定性的同时提供对聚类结构的可解释性。

摘要翻译

聚类方法的一个主要局限在于其缺乏可解释性：现有方法很少能揭示哪些特征驱动了相似观测值的分组。为应对这一局限，我们提出一种基于集成学习的聚类框架，该框架通过整合自助法（bagging）与特征丢弃（feature dropout）来生成特征重要性评分，其原理类似于监督式随机森林中的特征重要性机制。通过利用多重自助重抽样方案并聚合所得分区，该方法提升了聚类定义的稳定性与鲁棒性，尤其适用于小样本或高噪声场景。特征重要性通过信息论方法进行评估：在每一步中，计算每个特征与估计聚类标签之间的互信息，并依据聚类有效性度量进行加权以突出结构良好的分区，最终聚合为综合评分。该方法同时输出共识分区及相应的特征重要性度量，从而实现对聚类结构与变量关联性的统一解释。其在多个模拟与真实数据集上的有效性得到了验证。

摘要 (Abstract)

A major limitation of clustering approaches is their lack of explainability: methods rarely provide insight into which features drive the grouping of similar observations. To address this limitation, we propose an ensemble-based clustering framework that integrates bagging and feature dropout to generate feature importance scores, in analogy with feature importance mechanisms in supervised random forests. By leveraging multiple bootstrap resampling schemes and aggregating the resulting partitions, the method improves stability and robustness of the cluster definition, particularly in small-sample or noisy settings. Feature importance is assessed through an information-theoretic approach: at each step, the mutual information between each feature and the estimated cluster labels is computed and weighted by a measure of clustering validity to emphasize well-formed partitions, before being aggregated into a final score. The method outputs both a consensus partition and a corresponding measure of feature importance, enabling a unified interpretation of clustering structure and variable relevance. Its effectiveness is demonstrated on multiple simulated and real-world datasets.

关键词: clustering, explainability, bagging, feature importance, ensemble methods, mutual information, cluster analysis, interpretability

222. ❌ GDEGAN: Gaussian Dynamic Equivariant Graph Attention Network for Ligand Binding Site Prediction

作者: Animesh, Plaban Kumar Bhowmick, Pralay Mitra 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19817v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文专注于使用图神经网络（GNN）进行蛋白质结合位点预测，属于生物信息学领域。所有关键词均与大模型（LLM）技术、训练方法、推理优化、对齐、代理系统等直接相关，而本文的核心是GNN在科学计算中的应用，仅与最后一个关键词’AI for Science OR Bioinformatics OR Cheminformatics’高度相关，其他关键词均不涉及，因此除最后一个关键词外，其余评分均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为GDEGAN的图注意力网络，用于准确预测蛋白质的配体结合位点，在多个数据集上显著提升了预测性能。

摘要翻译

准确预测给定蛋白质上配体可结合的结合位点，是基于结构的计算药物发现中的关键步骤。近年来，随着蛋白质数据库和AlphaFold预测提供大规模蛋白质三维结构，等变图神经网络已成为结合位点识别方法的有力范式。当前最先进的等变图神经网络方法采用点积注意力机制，忽略了邻近残基在化学和几何特性上的差异。为捕捉这种差异，我们提出了GDEGAN（高斯动态等变图注意力网络），该网络用自适应核替代点积注意力，以识别结合位点。所提出的注意力机制通过邻近残基特征局部分布的统计信息来捕捉其变化。我们的机制在每一层动态计算邻域统计量，使用局部方差作为自适应带宽参数，并配备可学习的每头温度系数，使得每个蛋白质区域能够自主确定其特定上下文的重要性。在COACH420、HOLO4k和PDBbind2020数据集上，GDEGAN在DCC指标上相对现有方法提升37-66%，在DCA成功率上提升7-19%。这些进展可直接应用于加速蛋白质-配体对接过程，通过识别潜在结合位点来促进治疗靶点的发现。

摘要 (Abstract)

Accurate prediction of binding sites of a given protein, to which ligands can bind, is a critical step in structure-based computational drug discovery. Recently, Equivariant Graph Neural Networks (GNNs) have emerged as a powerful paradigm for binding site identification methods due to the large-scale availability of 3D structures of proteins via protein databases and AlphaFold predictions. The state-of-the-art equivariant GNN methods implement dot product attention, disregarding the variation in the chemical and geometric properties of the neighboring residues. To capture this variation, we propose GDEGAN (Gaussian Dynamic Equivariant Graph Attention Network), which replaces dot-product attention with adaptive kernels that recognize binding sites. The proposed attention mechanism captures variation in neighboring residues using statistics of their characteristic local feature distributions. Our mechanism dynamically computes neighborhood statistics at each layer, using local variance as an adaptive bandwidth parameter with learnable per-head temperatures, enabling each protein region to determine its own context-specific importance. GDEGAN outperforms existing methods with relative improvements of 37-66% in DCC and 7-19% DCA success rates across COACH420, HOLO4k, and PDBBind2020 datasets. These advances have direct application in accelerating protein-ligand docking by identifying potential binding sites for therapeutic target identification.

关键词: Binding site prediction, Equivariant Graph Neural Networks, Gaussian Dynamic Attention, Protein-ligand docking, Drug discovery, Bioinformatics, Computational biology, AlphaFold

223. ❌ FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization

作者: Chiyu Ma, Shuo Yang, Kexin Huang, Jinda Lu, Haoming Meng, Shangshang Wang, Bolin Ding, Soroush Vosoughi, Guoyin Wang, Jingren Zhou 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19835v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	10.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	10.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文提出了一种名为FIPO的强化学习算法，专门用于解决大语言模型（LLMs）在推理任务中的瓶颈问题。该算法通过引入未来KL散度来改进策略优化，实现更精细的信用分配，从而提升模型在复杂推理任务（如Chain of Thought、多步推理）中的表现。因此，与’Large Language Models’、‘RLHF’、‘Chain of Thought’和’System 2 Thinking’高度相关（10分），因为这些关键词直接对应论文的核心研究对象（LLMs）、方法（强化学习优化）和任务（深度推理）。其他关键词如MoE、SLMs、RAG、量化等与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

论文提出了一种名为FIPO的强化学习算法，通过引入未来KL散度来改进大语言模型的推理能力，在Qwen2.5-32B模型上成功将平均思维链长度从约4000词扩展到超过10000词，并将AIME 2024 Pass@1准确率从50.0%提升至峰值58.0%。

摘要翻译

本文提出未来KL散度影响策略优化算法（Future-KL Influenced Policy Optimization, FIPO），这是一种旨在克服大语言模型中推理瓶颈的强化学习算法。尽管GRPO风格训练能有效扩展，但其通常依赖于基于结果的奖励模型（outcome-based reward model, ORM），这种模型将全局优势均匀分配给轨迹中的每个标记。我们认为，这种粗粒度的信用分配因无法区分关键逻辑支点与次要标记而形成了性能瓶颈。FIPO通过将折现后的未来KL散度纳入策略更新来解决这一问题，构建了一种密集优势表达，根据标记对后续轨迹行为的影响重新调整其权重。实验表明，FIPO使模型能够突破标准基线中出现的长度停滞现象。在Qwen2.5-32B上的评估显示，FIPO将平均思维链长度从约4,000标记延长至超过10,000标记，并将AIME 2024 Pass@1准确率从50.0%提升至峰值58.0%（最终收敛于约56.0%）。该结果优于DeepSeek-R1-Zero-Math-32B（约47.0%）和o1-mini（约56.0%）。我们的研究表明，建立密集优势表达是推动基于ORM的算法发展、释放基础模型完整推理潜力的关键路径。我们开源了基于verl框架构建的训练系统。

摘要 (Abstract)

We present Future-KL Influenced Policy Optimization (FIPO), a reinforcement learning algorithm designed to overcome reasoning bottlenecks in large language models. While GRPO style training scales effectively, it typically relies on outcome-based rewards (ORM) that distribute a global advantage uniformly across every token in a trajectory. We argue that this coarse-grained credit assignment imposes a performance ceiling by failing to distinguish critical logical pivots from trivial tokens. FIPO addresses this by incorporating discounted future-KL divergence into the policy update, creating a dense advantage formulation that re-weights tokens based on their influence on subsequent trajectory behavior. Empirically, FIPO enables models to break through the length stagnation seen in standard baselines. Evaluated on Qwen2.5-32B, FIPO extends the average chain-of-thought length from roughly 4,000 to over 10,000 tokens and increases AIME 2024 Pass@1 accuracy from 50.0% to a peak of 58.0% (converging at approximately 56.0%). This outperforms both DeepSeek-R1-Zero-Math-32B (around 47.0%) and o1-mini (approximately 56.0%). Our results suggest that establishing dense advantage formulations is a vital path for evolving ORM-based algorithms to unlock the full reasoning potential of base models. We open-source our training system, built on the verl framework.

关键词: FIPO, reinforcement learning, large language models, reasoning bottlenecks, chain-of-thought, policy optimization, future-KL divergence, credit assignment

224. ❌ Eye Gaze-Informed and Context-Aware Pedestrian Trajectory Prediction in Shared Spaces with Automated Shuttles: A Virtual Reality Study

作者: Danya Li, Yan Feng, Rico Krueger 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19812v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是基于虚拟现实的、结合眼动追踪和上下文信息的行人轨迹预测模型（GazeX-LSTM），应用于共享空间中自动驾驶班车的安全交互。论文的核心是计算机视觉、行为建模和自动驾驶应用，未涉及任何大语言模型（LLM）、深度学习技术原理创新或AI for Science的具体子领域。所有评分关键词均与大模型、深度学习技术或科学AI应用直接相关，而本文完全不涉及这些主题，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文通过虚拟现实研究，提出了一个结合眼动数据和上下文信息的行人轨迹预测模型（GazeX-LSTM），用于提升共享空间中自动驾驶班车与行人交互的安全性和预测准确性。

摘要翻译

将自动驾驶接驳车整合至共享城市空间时，由于缺乏明确的交通规则及复杂的人车交互，带来了独特挑战。因此，在这种非结构化环境中准确预测行人行为对于保障安全与效率至关重要。本文通过一项虚拟现实（VR）研究，捕捉了行人在多种场景下与自动驾驶接驳车的交互行为，包括不同接近角度及连续车流中的穿行情况。我们识别了共享空间中行人决策的关键行为模式，包括犹豫、避让动作、视线分配及空间距离调整。为建模行人行为，我们提出了GazeX-LSTM——一种多模态视线感知与情境感知的预测模型，该模型整合了行人轨迹、细粒度视线动态及情境因素。通过利用眼动追踪数据捕捉行人注意力，我们将预测视角从车辆中心转向以人为中心。我们系统验证了视线数据相较于头部朝向所具有的独特且不可替代的预测能力，并通过融合情境变量进一步提升了模型性能。值得注意的是，视线数据与情境信息的结合对行人行为预测准确率产生了超叠加性提升，揭示了视觉注意力与情境因素之间的互补关系。综上，我们的研究首次证明：基于视线信息的建模能从根本上推进行人行为预测，并突显了情境因素在共享空间交互中的关键作用。这为开发更安全、适应性更强的自动驾驶车辆技术铺平了道路，使技术能更好地考量人们在复杂共享空间中的感知与行为方式。

摘要 (Abstract)

The integration of Automated Shuttles into shared urban spaces presents unique challenges due to the absence of traffic rules and the complex pedestrian interactions. Accurately anticipating pedestrian behavior in such unstructured environments is therefore critical for ensuring both safety and efficiency. This paper presents a Virtual Reality (VR) study that captures how pedestrians interact with automated shuttles across diverse scenarios, including varying approach angles and navigating in continuous traffic. We identify critical behavior patterns present in pedestrians’ decision-making in shared spaces, including hesitation, evasive maneuvers, gaze allocation, and proxemic adjustments. To model pedestrian behavior, we propose GazeX-LSTM, a multimodal eye gaze-informed and context-aware prediction model that integrates pedestrians’ trajectories, fine-grained eye gaze dynamics, and contextual factors. We shift prediction from a vehicle- to a human-centered perspective by leveraging eye-tracking data to capture pedestrian attention. We systematically validate the unique and irreplaceable predictive power of eye gaze over head orientation alone, further enhancing performance by integrating contextual variables. Notably, the combination of eye gaze data and contextual information produces super-additive improvements on pedestrian behavior prediction accuracy, revealing the complementary relationship between visual attention and situational contexts. Together, our findings provide the first evidence that eye gaze-informed modeling fundamentally advances pedestrian behavior prediction and highlight the critical role of situational contexts in shared-space interactions. This paves the way for safer and more adaptive automated vehicle technologies that account for how people perceive and act in complex shared spaces.

关键词: pedestrian trajectory prediction, eye gaze tracking, automated shuttles, shared spaces, virtual reality study, context-aware modeling, GazeX-LSTM, human-vehicle interaction

225. ❌ Two-Time-Scale Learning Dynamics: A Population View of Neural Network Training

作者: Giacomo Borghi, Hyesung Im, Lorenzo Pareschi 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19808v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	5.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是神经网络训练的群体动力学理论框架，特别是双时间尺度学习动态，主要涉及模型合并方法（如Population-Based Training和model-merging methods）的理论分析。因此，仅与’Model Merging OR Model Soups OR Weight Averaging’关键词有中等关联（5分），因为论文提到了model-merging methods作为群体学习范式的一部分，但这不是其核心理论贡献。其他所有关键词均与论文内容完全无关（0分），因为论文不涉及大模型、深度学习技术原理创新、AI科学应用或任何具体的大模型技术（如LLMs、MoE、SFT、RLHF、RAG等）。

!!! tip deepseek-chat TL;DR

该论文提出了一个基于双时间尺度群体动力学的神经网络训练理论框架，通过建模参数和超参数的演化，证明了在大群体极限下的选择-突变方程，并揭示了噪声和多样性在优化与探索中的平衡作用。

摘要翻译

基于种群的优化范式，包括进化策略、种群训练法以及近期的模型融合方法，将快速的模型内优化与缓慢的种群层面适应相结合。尽管这些方法在实践中取得了成功，但对其集体训练动态的通用数学描述仍不完善。我们提出了一个基于双时间尺度种群动态的神经网络训练理论框架。我们将神经网络种群建模为一个交互智能体系统，其中网络参数通过SGD/朗之万类型的快速噪声梯度更新而演化，而超参数则通过更缓慢的选择-突变动态演化。我们证明了参数与超参数联合分布的大种群极限，并在强时间尺度分离条件下，推导出超参数密度的选择-突变方程。对于每个固定的超参数，快速参数动态会弛豫至一个玻尔兹曼-吉布斯测度，从而为慢速演化诱导出一个有效适应度。该平均化动态将基于种群的学习与双层优化及经典复制子-突变子模型联系起来，给出了种群均值向最适超参数移动的条件，并阐明了噪声和多样性在平衡优化与探索中的作用。数值实验同时说明了大规模种群体系与简化后的双时间尺度动态，并表明获取有效适应度（无论是闭式形式还是通过种群层面估计）能够改进种群层面的更新。

摘要 (Abstract)

Population-based learning paradigms, including evolutionary strategies, Population-Based Training (PBT), and recent model-merging methods, combine fast within-model optimisation with slower population-level adaptation. Despite their empirical success, a general mathematical description of the resulting collective training dynamics remains incomplete. We introduce a theoretical framework for neural network training based on two-time-scale population dynamics. We model a population of neural networks as an interacting agent system in which network parameters evolve through fast noisy gradient updates of SGD/Langevin type, while hyperparameters evolve through slower selection–mutation dynamics. We prove the large-population limit for the joint distribution of parameters and hyperparameters and, under strong time-scale separation, derive a selection–mutation equation for the hyperparameter density. For each fixed hyperparameter, the fast parameter dynamics relaxes to a Boltzmann–Gibbs measure, inducing an effective fitness for the slow evolution. The averaged dynamics connects population-based learning with bilevel optimisation and classical replicator–mutator models, yields conditions under which the population mean moves toward the fittest hyperparameter, and clarifies the role of noise and diversity in balancing optimisation and exploration. Numerical experiments illustrate both the large-population regime and the reduced two-time-scale dynamics, and indicate that access to the effective fitness, either in closed form or through population-level estimation, can improve population-level updates.

关键词: two-time-scale learning dynamics, population-based learning, neural network training, model-merging methods, selection-mutation dynamics, large-population limit, hyperparameter evolution, effective fitness

226. ❌ Quantifying Gate Contribution in Quantum Feature Maps for Scalable Circuit Optimization

作者: F. Rodríguez-Díaz, D. Gutiérrez-Avilés, A. Troncoso, F. Martínez-Álvarez 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19805v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于量子机器学习（QML）中的电路优化，特别是量子特征映射的优化方法。所有给定的关键词均与大语言模型（LLMs）、深度学习技术原理或AI在科学领域的应用（如生物信息学、化学信息学）相关，而本文研究的是量子计算领域的机器学习，属于不同的技术范式。论文未涉及任何大模型、深度学习或AI for Science（如生物/化学信息学）相关内容，因此所有关键词得分为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种名为GATE的量子电路优化方法，通过量化门贡献度来减少量子特征映射的电路规模，在多个执行场景下实现了电路大小和运行时间的减少，同时保持或提高了预测准确性。

摘要翻译

量子机器学习为分类任务提供了前景广阔的优势，但当前设备中的噪声、退相干和连接性限制仍制约着基于特征映射线路的高效执行。本文提出门评估与阈值判定（GATE）作为一种线路优化方法，通过一种新颖的门显著性指标来简化量子特征映射。该指标结合保真度、纠缠度和灵敏度来量化每个门的重要性，其构建适用于两种环境：一是可访问量子态的模拟器/仿真器环境，二是真实硬件环境——在后一环境中，这些量值需通过测量结果和辅助线路进行估计。该方法迭代扫描阈值范围，剔除贡献度较低的门，生成优化的量子机器学习模型，并依据准确率、运行时间及平衡性能指标对其进行排序，最终进行测试验证。研究使用两种代表性量子机器学习模型（PegasosQSVM与量子神经网络），在三种执行场景（无噪声模拟、基于IBM后端生成的含噪声仿真、真实IBM量子硬件）中对现实世界分类数据集进行了评估。分析了特征映射中门移除的结构性影响，探讨了该方法与噪声缓解技术的兼容性，并借助基于密度矩阵、矩阵乘积态、张量网络及真实设备的多种方法评估了指标计算的可扩展性。结果表明，该方法能持续减小线路规模和运行时间，且在多数情况下保持或提升了预测准确率；最佳平衡点通常出现在中等阈值水平，而非基线线路或过度压缩的线路中。

摘要 (Abstract)

Quantum machine learning offers promising advantages for classification tasks, but noise, decoherence, and connectivity constraints in current devices continue to limit the efficient execution of feature map-based circuits. Gate Assessment and Threshold Evaluation (GATE) is presented as a circuit optimization methodology that reduces quantum feature maps using a novel gate significance index. This index quantifies the relevance of each gate by combining fidelity, entanglement, and sensitivity. It is formulated for both simulator/emulator environments, where quantum states are accessible, and for real hardware, where these quantities are estimated from measurement results and auxiliary circuits. The approach iteratively scans a threshold range, eliminates low-contribution gates, generates optimized quantum machine learning models, and ranks them based on accuracy, runtime, and a balanced performance criterion before final testing. The methodology is evaluated on real-world classification datasets using two representative quantum machine learning models, PegasosQSVM and Quantum Neural Network, in three execution scenarios: noise-free simulation, noisy emulation derived from an IBM backend, and real IBM quantum hardware. The structural impact of gate removal in feature maps is examined, compatibility with noise-mitigation techniques is studied, and the scalability of index computation is evaluated using approaches based on density matrices, matrix product states, tensor networks, and real-world devices. The results show consistent reductions in circuit size and runtime and, in many cases, preserved or improved predictive accuracy, with the best trade-offs typically occurring at intermediate thresholds rather than in the baseline circuits or in those compressed more aggressively.

关键词: Quantum Machine Learning, Circuit Optimization, Gate Significance Index, Feature Maps, Quantum Neural Network, PegasosQSVM, Noise Mitigation, Scalability

227. ❌ Scalable Learning of Multivariate Distributions via Coresets

作者: Zeyu Ding, Katja Ickstadt, Nadja Klein, Alexander Munteanu, Simon Omlor 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19792v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文专注于统计学和机器学习中的核心集方法，用于提升多元条件变换模型的可扩展性和训练效率。论文内容涉及非参数/半参数回归、密度估计、数据缩减和重要性采样，但未提及任何大模型、深度学习、语言模型、对齐、推理、代理、压缩或科学AI等关键词。所有关键词均与大模型技术或其在科学领域的应用相关，而本文是纯粹的统计机器学习方法研究，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文提出了一种用于多元条件变换模型的新型核心集构建方法，通过重要性采样实现数据缩减，在保持统计精度的同时显著提高了大规模复杂数据集的计算效率。

摘要翻译

高效且可扩展的非参数或半参数回归分析与密度估计对统计学和机器学习领域至关重要。然而，现有方法在处理大规模数据时能力有限。我们通过为多元条件转换模型（Multivariate Conditional Transformation Models, MCTMs）开发一种新颖的核心集（coreset）构建方法来解决这一问题，以提升其可扩展性与训练效率。据我们所知，这是首次为半参数分布模型构建核心集。我们的方法通过重要性采样实现了显著的数据缩减，并以高概率确保对数似然保持在$(1\pm\varepsilon)$的乘法误差界内，从而维持了统计模型的准确性。与先前已引入核心集的全参数模型相比，我们的半参数方法展现出更强的适应性，尤其在存在复杂分布和非线性关系但尚未被完全理解的场景中。为解决与对数项归一化相关的数值问题，我们采用了一种基于输入数据凸包（convex hull）的几何近似方法。这确保了在涉及大量数据的场景中进行可行、稳定且准确的推断。数值实验表明，在处理大规模复杂数据集时，计算效率得到显著提升，从而为统计学和机器学习领域的广泛应用奠定了基础。

摘要 (Abstract)

Efficient and scalable non-parametric or semi-parametric regression analysis and density estimation are of crucial importance to the fields of statistics and machine learning. However, available methods are limited in their ability to handle large-scale data. We address this issue by developing a novel coreset construction for multivariate conditional transformation models (MCTMs) to enhance their scalability and training efficiency. To the best of our knowledge, these are the first coresets for semi-parametric distributional models. Our approach yields substantial data reduction via importance sampling. It ensures with high probability that the log-likelihood remains within multiplicative error bounds of $(1\pm\varepsilon)$ and thereby maintains statistical model accuracy. Compared to conventional full-parametric models, where coresets have been incorporated before, our semi-parametric approach exhibits enhanced adaptability, particularly in scenarios where complex distributions and non-linear relationships are present, but not fully understood. To address numerical problems associated with normalizing logarithmic terms, we follow a geometric approximation based on the convex hull of input data. This ensures feasible, stable, and accurate inference in scenarios involving large amounts of data. Numerical experiments demonstrate substantially improved computational efficiency when handling large and complex datasets, thus laying the foundation for a broad range of applications within the statistics and machine learning communities.

关键词: coresets, multivariate conditional transformation models, scalability, importance sampling, log-likelihood, semi-parametric models, data reduction, computational efficiency

228. ❌ Learning from Similarity/Dissimilarity and Pairwise Comparison

作者: Tomoya Tate, Kosuke Sugiyama, Masato Uchida 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19713v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是弱监督学习中的二元分类问题，提出了一种基于相似性/相异性和成对比较标签的新框架SD-Pcomp。论文内容完全聚焦于传统机器学习中的弱监督学习、风险估计和分类算法，没有涉及任何大语言模型、深度学习架构、训练方法、推理优化、对齐技术、AI代理或科学AI应用等关键词领域。所有关键词均与大模型和深度学习技术相关，而本文是传统机器学习方法研究，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于相似性/相异性和成对比较标签的弱监督二元分类框架SD-Pcomp，通过开发无偏风险估计器，在标签噪声和类别先验估计不确定的情况下，相比使用单一弱标签的方法提高了分类性能。

摘要翻译

本文针对难以获取显式实例级标签的场景，通过利用定义在实例对上的多重弱标签来解决二分类问题。现有的SconfConfDiff分类框架依赖于连续取值的概率监督信息，包括相似度置信度（即类别一致概率）和置信度差异（即正类概率之差）。然而，概率标注需要主观的不确定性量化，常导致监督信号不稳定。我们提出SD-Pcomp分类方法，这是一种基于相对判断的弱监督学习框架，仅依赖于两种相对判断：两个实例间的类别一致性以及面向正类的成对偏好关系。该方法采用相似性/相异性标签和成对比较标签，并构建了两个无偏风险估计量：（i）SD与Pcomp的凸组合形式；（ii）通过建模二者关系将两种标签统一整合的估计量。理论分析与实验结果表明，所提方法相较于使用单一弱标签的分类方法提升了性能，并且对标签噪声及类先验估计的不确定性具有鲁棒性。

摘要 (Abstract)

This paper addresses binary classification in scenarios where obtaining explicit instance level labels is impractical, by exploiting multiple weak labels defined on instance pairs. The existing SconfConfDiff classification framework relies on continuous valued probabilistic supervision, including similarity-confidence, the probability of class agreement, and confidence-difference, the difference in positive class probabilities. However, probabilistic labeling requires subjective uncertainty quantification, often leading to unstable supervision. We propose SD-Pcomp classification, a binary judgment based weakly supervised learning framework that relies only on relative judgments, namely class agreement between two instances and pairwise preference toward the positive class. The method employs Similarity/Dissimilarity (SD) labels and Pairwise Comparison (Pcomp) labels, and develops two unbiased risk estimators, (i) a convex combination of SD and Pcomp and (ii) a unified estimator that integrates both labels by modeling their relationship. Theoretical analysis and experimental results show that the proposed approach improves classification performance over methods using a single weak label, and is robust to label noise and uncertainty in class prior estimation.

关键词: weakly supervised learning, binary classification, similarity/dissimilarity labels, pairwise comparison, unbiased risk estimator, label noise robustness, class prior estimation, SD-Pcomp framework

229. ❌ A two-step sequential approach for hyperparameter selection in finite context models

作者: José Contente, Ana Martins, Armando J. Pinho, Sónia Gouveia 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19736v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究有限上下文模型（FCMs）的超参数选择方法，这是一种传统的序列建模和压缩技术，主要应用于DNA序列等符号数据。论文内容完全不涉及大语言模型（LLMs）、深度学习、大模型技术原理或其在科学领域的应用创新。所有评分关键词均与大模型、深度学习、AI对齐、推理、代理、优化等现代AI技术相关，而本文专注于传统的统计建模和压缩算法，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种用于有限上下文模型（FCMs）超参数选择的两步顺序方法，通过分解联合优化问题并利用统计依赖度量，在保持压缩性能的同时显著降低了计算成本。

摘要翻译

有限上下文模型（FCMs）被广泛用于压缩DNA等符号序列，其预测性能关键取决于上下文长度k和平滑参数α。在实际应用中，这些超参数通常通过穷举搜索来选择，这种方法计算成本高昂，且随着模型复杂度的增加，其扩展性较差。本文提出一种基于统计原理的两步序贯方法，用于在FCMs中高效选择超参数。其核心思想是将联合优化问题分解为两个独立的阶段。首先，利用分类序列依赖性度量（包括Cramér’s ν、Cohen’s κ和偏互信息（pami））估计上下文长度k。其次，在选定上下文长度k的条件下，通过最大似然估计平滑参数α。研究在多种（k, α）配置下，针对FCMs生成的人工合成符号序列（考虑四字母表及不同样本量）进行了模拟实验。结果表明，依赖性度量对k的变化比对α的变化敏感得多，这支持了序贯估计策略。正如预期，超参数估计的准确性随样本量的增加而提高。此外，所提方法在平均比特率（每符号比特数）方面达到了与穷举网格搜索相当的压缩性能，同时显著降低了计算成本。总体而言，模拟数据上的结果表明，所提出的序贯方法是FCMs中穷举超参数调优的一种实用且计算高效的替代方案。

摘要 (Abstract)

Finite-context models (FCMs) are widely used for compressing symbolic sequences such as DNA, where predictive performance depends critically on the context length k and smoothing parameter α. In practice, these hyperparameters are typically selected through exhaustive search, which is computationally expensive and scales poorly with model complexity. This paper proposes a statistically grounded two-step sequential approach for efficient hyperparameter selection in FCMs. The key idea is to decompose the joint optimization problem into two independent stages. First, the context length k is estimated using categorical serial dependence measures, including Cramér’s ν, Cohen’s \k{appa} and partial mutual information (pami). Second, the smoothing parameter α is estimated via maximum likelihood conditional on the selected context length k. Simulation experiments were conducted on synthetic symbolic sequences generated by FCMs across multiple (k, α) configurations, considering a four-letter alphabet and different sample sizes. Results show that the dependence measures are substantially more sensitive to variations in k than in α, supporting the sequential estimation strategy. As expected, the accuracy of the hyperparameter estimation improves with increasing sample size. Furthermore, the proposed method achieves compression performance comparable to exhaustive grid search in terms of average bitrate (bits per symbol), while substantially reducing computational cost. Overall, the results on simulated data show that the proposed sequential approach is a practical and computationally efficient alternative to exhaustive hyperparameter tuning in FCMs.

关键词: finite-context models, hyperparameter selection, sequential approach, context length, smoothing parameter, compression, computational efficiency, symbolic sequences

230. ❌ Minimax and Adaptive Covariance Matrix Estimation under Differential Privacy

作者: T. Tony Cai, Yicheng Li 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19703v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究高维可带协方差矩阵在差分隐私约束下的极小极大和自适应估计问题，属于统计学和隐私保护领域。论文内容完全不涉及大模型、深度学习、AI技术原理或科学AI应用，所有关键词均与大模型技术、训练方法、推理优化、AI应用等无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了在差分隐私约束下高维可带协方差矩阵的极小极大和自适应估计问题，提出了一种新颖的差分隐私块状三对角估计器，并建立了最优收敛率。

摘要翻译

协方差矩阵在高维数据分析中起着基础性作用。本文研究了差分隐私约束下高维带状协方差矩阵的极小极大估计与自适应估计。我们提出了一种新颖的差分隐私块状三对角估计器，该估计器在算子范数和Frobenius范数下均达到了极小极大最优收敛率。与非隐私设定相比，隐私引入的误差表现出对环境维度的多项式依赖，揭示了隐私保护的显著额外代价。
为证明最优性，我们发展了一种新的差分隐私van Trees不等式，并通过精心设计的先验分布构造了匹配的极小极大下界。所提出的隐私van Trees不等式可更广泛地应用于一般隐私估计问题，具有独立的理论价值。基于一种新颖的层次化三对角方法，我们进一步提出了一种自适应估计器，该估计器在无需预先知晓衰减参数的情况下，能以对数因子达到最优收敛率。数值实验验证了理论结果，并阐明了隐私与精度之间根本性的权衡关系。

摘要 (Abstract)

The covariance matrix plays a fundamental role in the analysis of high-dimensional data. This paper studies minimax and adaptive estimation of high-dimensional bandable covariance matrices under differential privacy constraints. We propose a novel differentially private blockwise tridiagonal estimator that achieves minimax-optimal convergence rates under both the operator norm and the Frobenius norm. In contrast to the non-private setting, the privacy-induced error exhibits a polynomial dependence on the ambient dimension, revealing a substantial additional cost of privacy. To establish optimality, we develop a new differentially private van Trees inequality and construct carefully designed prior distributions to obtain matching minimax lower bounds. The proposed private van Trees inequality applies more broadly to general private estimation problems and is of independent interest. We further introduce an adaptive estimator that attains the optimal rate up to a logarithmic factor without prior knowledge of the decay parameter, based on a novel hierarchical tridiagonal approach. Numerical experiments corroborate the theoretical results and illustrate the fundamental privacy-accuracy trade-off.

关键词: covariance matrix estimation, differential privacy, minimax optimality, high-dimensional data, adaptive estimation, privacy-accuracy trade-off, bandable covariance matrices

231. ❌ Regret Analysis of Sleeping Competing Bandits

作者: Shinnosuke Uba, Yutaro Yamaguchi 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19700v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是多臂赌博机（multi-armed bandits）与稳定匹配（stable matching）结合的在线学习框架，属于经典机器学习与博弈论交叉领域，完全不涉及大模型、深度学习、AI for Science等关键词相关的技术或应用。论文内容聚焦于算法理论分析（遗憾界证明），与所有评分关键词均无关联。

!!! tip deepseek-chat TL;DR

该论文提出了Sleeping Competing Bandits模型，分析了在玩家和臂随时间动态可用情况下的在线学习问题，并给出了一个算法及其渐近最优的遗憾界。

摘要翻译

竞争性赌博机框架是近期兴起的研究领域，它将在线学习中的多臂赌博机问题与博弈论中的稳定匹配理论相结合。传统模型通常假设所有参与者和选项始终可用，但在现实问题中，参与者和选项的可用性可能随时间任意变化。本文将此情境建模为休眠竞争性赌博机问题。为分析该问题，我们自然地扩展了现有竞争性赌博机研究中使用的遗憾定义，并推导出所提出模型的遗憾界。我们提出一种算法，在合理假设下同时达到$\mathrm{O}\left(NK\log T_{i}/Δ^2\right)$的渐近遗憾界，其中$N$为参与者数量，$K$为选项数量，$T_{i}$为每个参与者$p_i$的回合数，$Δ$为最小奖励间隔。在相同假设下，我们还给出了$\mathrmΩ\left( N(K-N+1)\log T_{i}/Δ^2 \right)$的遗憾下界。这表明当选项数量$K$相对大于参与者数量$N$时，我们的算法具有渐近最优性。

摘要 (Abstract)

The Competing Bandits framework is a recently emerging area that integrates multi-armed bandits in online learning with stable matching in game theory. While conventional models assume that all players and arms are constantly available, in real-world problems, their availability can vary arbitrarily over time. In this paper, we formulate this setting as Sleeping Competing Bandits. To analyze this problem, we naturally extend the regret definition used in existing competing bandits and derive regret bounds for the proposed model. We propose an algorithm that simultaneously achieves an asymptotic regret bound of $\mathrm{O}\left(NK\log T_{i}/Δ^2\right)$ under reasonable assumptions, where $N$ is the number of players, $K$ is the number of arms, $T_{i}$ is the number of rounds of each player $p_i$, and $Δ$ is the minimum reward gap. We also provide a regret lower bound of $\mathrmΩ\left( N(K-N+1)\log T_{i}/Δ^2 \right)$ under the same assumptions. This implies that our algorithm is asymptotically optimal in the regime where the number of arms $K$ is relatively larger than the number of players $N$.

关键词: Competing Bandits, Sleeping Bandits, Multi-armed Bandits, Stable Matching, Regret Analysis, Online Learning, Asymptotic Optimality, Game Theory

232. ❌ Diminishing Returns in Expanding Generative Models and Godel-Tarski-Lob Limits

作者: Angshul Majumdar 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19687v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	5.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	8.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究扩展生成系统（包括基于Transformer的模型）的能力增长极限，与"Scaling Laws AND Data Quality"高度相关（8分），因为它分析模型容量扩展时的边际收益递减规律；与"Large Language Models OR LLMs OR Foundation Models"有一定关联（5分），因为论文提到Transformer-based models作为示例系统；其他关键词涉及具体技术方法（如MoE、SFT、RAG等）或应用领域（如AI for Science），论文未涉及，故均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了扩展生成模型（包括Transformer模型）的能力增长极限，证明了随着系统容量增加，可解决任务的边际改进必然趋近于零，并揭示了表达能力充分的推理系统存在无法解决的逻辑任务。

摘要翻译

现代生成建模系统正通过扩展模型容量、训练数据和计算资源而持续改进。尽管实证研究已记录了包括生成对抗网络、变分自编码器、基于Transformer的模型以及扩散模型在内的多种架构中的此类扩展规律，但生成系统在扩展过程中能力增长的理论极限仍鲜为人知。
本文提出一个通用的任务空间框架，用于分析扩展中的生成推理系统。每个系统会诱导出全局任务空间的一个子集，该子集代表系统能够成功解决的任务，而系统能力通过固定任务分布下该已解决任务集的概率质量来衡量。在此框架内，我们证明了一个结构性结论：在温和假设下，随着系统容量的增加，已解决任务的边际改进必然收敛于零。因此，扩展的生成系统可能持续获得能力，但新可解任务的概率质量必然渐近递减。
我们进一步基于受算法概率启发的复杂度加权假设类，提出一种预测理论上的细化框架，从而在预测场景中得出边际改进的定量界限。最后，我们考察逻辑推理任务，并证明数理逻辑中的经典结论——包括罗塞尔不可完备性定理、塔斯基不可定义定理以及勒布定理——意味着在足够富有表达力的推理系统中，始终存在无法解决的逻辑任务。
这些结论共同为扩展生成系统的渐近行为提供了数学视角，表明长期能力增长既受任务覆盖范围边际效益递减的制约，也受内部推理的基本逻辑局限性的约束。

摘要 (Abstract)

Modern generative modelling systems are increasingly improved by expanding model capacity, training data, and computational resources. While empirical studies have documented such scaling behaviour across architectures including generative adversarial networks, variational autoencoders, transformer-based models, and diffusion models, the theoretical limits of capability growth in expanding generative systems remain poorly understood. In this paper we develop a general task-space framework for analysing expanding generative reasoning systems. Each system induces a subset of a global task space representing the tasks it can successfully solve, and system capability is measured by the probability mass of this solved-task set under a fixed task distribution. Within this framework we prove a structural result showing that, under mild assumptions, the marginal improvement in solved tasks must converge to zero as system capacity increases. Thus expanding generative systems may continue to gain capability, but the probability mass of newly solvable tasks necessarily diminishes asymptotically. We further provide a prediction-theoretic refinement based on complexity-weighted hypothesis classes inspired by algorithmic probability, yielding quantitative bounds on marginal improvement in prediction settings. Finally, we examine logical reasoning tasks and show that classical results from mathematical logic – including Rosser incompleteness, Tarski’s undefinability theorem, and Löb’s theorem – imply the persistence of unresolved logical tasks within sufficiently expressive reasoning systems. Together these results provide a mathematical perspective on the asymptotic behaviour of expanding generative systems, showing that long-run capability growth is constrained both by diminishing marginal improvements in task coverage and by fundamental logical limitations on internal reasoning.

关键词: generative models, scaling laws, diminishing returns, task-space framework, logical limitations, transformer-based models, capability growth, asymptotic behavior

233. ❌ Ontology-Based Knowledge Modeling and Uncertainty-Aware Outdoor Air Quality Assessment Using Weighted Interval Type-2 Fuzzy Logic

作者: Md Inzmam, Ritesh Chandra, Sadhana Tiwari, Sonali Agarwal, Triloki Pant 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19683v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	8.0/10	0.0

评分理由: 该论文专注于使用本体论和模糊逻辑进行空气质量评估，属于环境科学和人工智能交叉领域，但未涉及大模型、深度学习技术原理或任何评分关键词中的具体技术（如LLM、MoE、RLHF等）。唯一相关的关键词是’Explainable AI’，因为论文提到了可解释的语义推理，但这不是核心焦点，因此给5分。‘AI for Science’部分相关，因为论文将AI应用于环境科学问题，给8分。其他所有关键词均与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该研究提出了一种基于本体论和加权区间二型模糊逻辑的混合框架，用于处理空气质量指数评估中的不确定性，相比传统方法提高了分类可靠性和不确定性处理能力。

摘要翻译

室外空气污染是环境与公众健康领域的重大关切问题，在城市化快速推进的区域尤为突出。印度空气质量指数（IND-AQI）由中央污染控制委员会（CPCB）制定，是基于细颗粒物（PM2.5）、可吸入颗粒物（PM10）、二氧化氮（NO2）、二氧化硫（SO2）、臭氧（O3）、一氧化碳（CO）和氨气（NH3）等污染物的标准化空气质量报告体系。然而，传统的AQI计算采用明确阈值与确定性聚合规则，难以有效处理不确定性与类别间的过渡问题。为应对这些局限，本研究提出一种基于混合本体的不确定性感知框架，该框架将加权区间二型模糊逻辑与语义知识建模相结合。区间二型模糊集用于刻画AQI类别边界附近的不确定性，而污染物重要性权重则通过区间二型模糊层次分析法（IT2-FAHP）确定，以反映其相对健康影响。此外，研究构建了基于OWL的空气质量本体，该本体扩展了语义传感器网络（SSN）本体，用于表征污染物、监测站点、AQI类别、监管标准及环境治理行动。通过SWRL规则实现语义推理，并借助SPARQL查询进行验证，以推断AQI类别、健康风险及建议的缓解措施。基于CPCB空气质量数据集的实验评估表明，与传统明确阈值及一型模糊方法相比，所提框架提升了AQI分类的可靠性与不确定性处理能力，同时为空气质量监测系统提供了可解释的语义推理与智能决策支持。

摘要 (Abstract)

Outdoor air pollution is a major concern for the environment and public health, especially in areas where urbanization is taking place rapidly. The Indian Air Quality Index (IND-AQI), developed by the Central Pollution Control Board (CPCB), is a standardized reporting system for air quality based on pollutants such as PM2.5, PM10), nitrogen dioxide (NO2), sulfur dioxide (SO2), ozone (O3), carbon monoxide (CO), and ammonia (NH3). However, the traditional calculation of the AQI uses crisp thresholds and deterministic aggregation rules, which are not suitable for handling uncertainty and transitions between classes. To address these limitations, this study proposes a hybrid ontology-based uncertainty-aware framework integrating Weighted Interval Type-2 Fuzzy Logic with semantic knowledge modeling. Interval Type-2 fuzzy sets are used to model uncertainty near AQI class boundaries, while pollutant importance weights are determined using Interval Type-2 Fuzzy Analytic Hierarchy Process (IT2-FAHP) to reflect their relative health impacts. In addition, an OWL-based air quality ontology extending the Semantic Sensor Network (SSN) ontology is developed to represent pollutants, monitoring stations, AQI categories, regulatory standards, and environmental governance actions. Semantic reasoning is implemented using SWRL rules and validated through SPARQL queries to infer AQI categories, health risks, and recommended mitigation actions. Experimental evaluation using CPCB air quality datasets demonstrates that the proposed framework improves AQI classification reliability and uncertainty handling compared with traditional crisp and Type-1 fuzzy approaches, while enabling explainable semantic reasoning and intelligent decision support for air quality monitoring systems

关键词: Ontology-based knowledge modeling, Interval Type-2 Fuzzy Logic, Air Quality Index, Uncertainty handling, Semantic reasoning, OWL ontology, SSN ontology, Environmental monitoring

234. ❌ Scale-Dependent Radial Geometry and Metric Mismatch in Wasserstein Propagation for Reverse Diffusion

作者: Zicheng Lyu, Zengfeng Huang 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19670v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是扩散模型的反向扩散过程中的几何分析和误差传播理论，属于数学和理论机器学习领域，主要涉及Wasserstein距离、SDE、几何分析等理论概念。论文内容完全不涉及大语言模型、深度学习技术原理、AI科学应用或任何评分关键词中提到的具体技术（如MoE、RLHF、RAG、量化等）。所有关键词均与大模型、深度学习技术及其应用无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了反向扩散过程中由于高斯平滑导致的几何不匹配问题，提出了基于径向轮廓的度量转换方法，并获得了非渐近的Wasserstein距离保证。

摘要翻译

现有对逆向扩散的分析通常将采样误差沿整个逆向轨迹在支撑(\Wtwo)的欧几里得几何中传播。然而在弱对数凹性条件下，高斯平滑可首先在大间距处产生收缩，而小间距仍保持非耗散特性。因此首个可用的收缩是径向的而非欧几里得的，这导致早期发生收缩的几何结构与最终误差被度量的几何结构之间存在度量失配。我们通过学习得到的逆向漂移项的显式径向下轮廓来形式化这种失配：其远场极限给出收缩余量，近场极限给出主导直接(\Wtwo)传播的欧几里得载荷，而允许的切换时间由剩余平滑窗口上收缩余量的正值特性所刻画。我们通过单次切换的路由论证来利用此结构：在切换前，反射耦合在适应径向轮廓的凹传输度量中产生收缩；切换时，我们在(p)阶矩预算下将该度量一次性转换回(\Wtwo)，随后在欧几里得几何中将转换后的差异在剩余短时间窗口内传播。对于在(L^2)分数误差控制、分数误差的单边利普希茨条件以及标准适定性与耦合假设下学习得到的逆向随机微分方程离散化，我们获得了显式的非渐近端到端(\Wtwo)保证、标量化的切换选择目标，以及仿射尾凹函数类内转换指数的尖锐结构极限。

摘要 (Abstract)

Existing analyses of reverse diffusion often propagate sampling error in the Euclidean geometry underlying (\Wtwo) along the entire reverse trajectory. Under weak log-concavity, however, Gaussian smoothing can create contraction first at large separations while short separations remain non-dissipative. The first usable contraction is therefore radial rather than Euclidean, creating a metric mismatch between the geometry that contracts early and the geometry in which the terminal error is measured. We formalize this mismatch through an explicit radial lower profile for the learned reverse drift. Its far-field limit gives a contraction reserve, its near-field limit gives the Euclidean load governing direct (\Wtwo) propagation, and admissible switch times are characterized by positivity of the reserve on the remaining smoothing window. We exploit this structure with a one-switch routing argument. Before the switch, reflection coupling yields contraction in a concave transport metric adapted to the radial profile. At the switch, we convert once from this metric back to (\Wtwo) under a (p)-moment budget, and then propagate the converted discrepancy over the remaining short window in Euclidean geometry. For discretizations of the learned reverse SDE under (L^2) score-error control, a one-sided Lipschitz condition of score error, and standard well-posedness and coupling hypotheses, we obtain explicit non-asymptotic end-to-end (\Wtwo) guarantees, a scalar switch-selection objective, and a sharp structural limit on the conversion exponent within the affine-tail concave class.

关键词: reverse diffusion, Wasserstein distance, metric mismatch, radial geometry, contraction analysis, SDE discretization, non-asymptotic guarantees, Gaussian smoothing

235. ❌ Model Selection and Parameter Estimation of Multi-dimensional Gaussian Mixture Model

作者: Xinyu Liu, Hai Zhang 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19657v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究多维高斯混合模型（GMM）的模型选择和参数估计问题，属于传统统计机器学习领域。论文内容完全不涉及大语言模型、深度学习、AI for Science等关键词相关的技术或应用。所有关键词均与大模型、深度学习、AI科学应用等主题无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了多维高斯混合模型的模型选择（确定混合成分数量）和参数估计问题，提出了基于傅里叶测量的阈值估计算法，证明了其样本复杂度达到信息论下界，并开发了梯度优化方法实现参数的最优收敛速率。

摘要翻译

本文研究了多维高斯混合模型的学习问题，重点聚焦于模型阶数选择与混合分布的高效估计。我们首先建立了可靠模型选择所需临界样本复杂度的信息论下界。具体而言，我们证明区分一个$k$分量混合模型与更简单模型所需的样本量需满足$Ω(Δ^{-(4k-4)})$的缩放规律。随后，我们提出一种基于阈值判定的估计算法，该算法通过评估由随机傅里叶测量向量构建的经验协方差矩阵的谱间隙来实现模型选择。这一无参数估计器的时间复杂度为$\mathcal{O}(k^2 n)$，与样本量呈线性关系。我们证明该方法的样本复杂度与所建立的下界相匹配，从而验证了其在分量分离距离$Δ$意义上的极小极大最优性。
在获得估计模型阶数的条件下，我们进一步提出基于梯度最小化的参数估计方法。为有效应对非凸目标函数的优化问题，我们采用数据驱动的分数初始化策略以保证快速收敛。理论证明该方法在估计分量均值时达到$\mathcal{O}_p(n^{-1/2})$的最优参数收敛速率。针对环境维度超过混合分量数（即(d > k)）的高维场景，我们引入主成分分析进行降维以提升算法效率。数值实验表明，基于傅里叶变换的算法框架在估计精度与计算时间上均优于传统的期望最大化方法。

摘要 (Abstract)

In this paper, we study the problem of learning multi-dimensional Gaussian Mixture Models (GMMs), with a specific focus on model order selection and efficient mixing distribution estimation. We first establish an information-theoretic lower bound on the critical sample complexity required for reliable model selection. More specifically, we show that distinguishing a $k$-component mixture from a simpler model necessitates a sample size scaling of $Ω(Δ^{-(4k-4)})$. We then propose a thresholding-based estimation algorithm that evaluates the spectral gap of an empirical covariance matrix constructed from random Fourier measurement vectors. This parameter-free estimator operates with an efficient time complexity of $\mathcal{O}(k^2 n)$, scaling linearly with the sample size. We demonstrate that the sample complexity of our method matches the established lower bound, confirming its minimax optimality with respect to the component separation distance $Δ$. Conditioned on the estimated model order, we subsequently introduce a gradient-based minimization method for parameter estimation. To effectively navigate the non-convex objective landscape, we employ a data-driven, score-based initialization strategy that guarantees rapid convergence. We prove that this method achieves the optimal parametric convergence rate of $\mathcal{O}_p(n^{-1/2})$ for estimating the component means. To enhance the algorithm’s efficiency in high-dimensional regimes where the ambient dimension exceeds the number of mixture components (i.e., (d > k)), we integrate principal component analysis (PCA) for dimension reduction. Numerical experiments demonstrate that our Fourier-based algorithmic framework outperforms conventional Expectation-Maximization (EM) methods in both estimation accuracy and computational time.

关键词: Gaussian Mixture Models, model selection, parameter estimation, spectral gap, Fourier measurements, gradient-based optimization, PCA dimension reduction, minimax optimality

236. ❌ Ensembles-based Feature Guided Analysis

作者: Federico Formica, Stefano Gregis, Andrea Rota, Aurora Francesca Zanenga, Mark Lawford, Claudio Menghi 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19653v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	10.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文专注于深度神经网络（DNN）的可解释性技术，提出了一种基于集成学习的特征引导分析（EFGA）方法，以提高规则提取的召回率。论文的核心主题是DNN可解释性，这与关键词’Mechanistic Interpretability OR Explainable AI’高度相关，因为该关键词直接涉及模型内部行为的解释和可解释性AI技术。然而，论文未涉及大语言模型（LLMs）、模型训练技术（如预训练、微调）、推理优化、对齐、代理系统、科学AI应用等其他关键词，这些关键词主要针对大模型技术及其应用，而本文研究的是通用DNN的可解释性，并非大模型特定领域。因此，除可解释性关键词外，其他关键词均评为0分。

!!! tip deepseek-chat TL;DR

本文提出了一种基于集成学习的特征引导分析（EFGA）方法，通过聚合规则来提高深度神经网络可解释性规则的召回率，在MNIST和LSC数据集上实现了召回率显著提升（最高+33.15%）而精度损失可忽略（低于-0.89%）。

摘要翻译

近期深度神经网络（DNN）的应用需要能够解释其行为的技术。现有解决方案，如特征引导分析（Feature Guided Analysis, FGA），通过提取其内部行为规则（例如提供与神经元激活相关的解释）来实现这一目标。文献结果表明，这些规则具有较高的精确度（即能正确预测特定类别的特征），但其召回率（即规则适用的情境数量）则较为有限。为缓解这一问题，本文提出了基于集成学习的特征引导分析（Ensembles-based Feature Guided Analysis, EFGA）。EFGA将FGA提取的规则组合成集成模型，通过聚合不同规则来提升其适用性，具体取决于聚合准则——即指导如何将规则组合成集成模型的策略。尽管我们的解决方案具有可扩展性，用户可开发不同的聚合准则，但本研究考虑了三种不同的聚合准则。我们评估了准则选择如何影响EFGA在两个基准数据集（即MNIST和LSC数据集）上的有效性，发现不同的聚合准则在精确度与召回率之间提供了不同的权衡方案。随后，我们将EFGA与FGA进行比较。在此实验中，我们选择了一种能在精确度与召回率之间取得合理平衡的聚合准则。实验结果表明，与FGA相比，EFGA在训练召回率（MNIST上提升28.51%，LSC上提升33.15%）和测试召回率（MNIST上提升25.76%，LSC上提升30.81%）方面均有显著提高，而测试精确度的下降可忽略不计（MNIST上降低0.89%，LSC上降低0.69%）。

摘要 (Abstract)

Recent Deep Neural Networks (DNN) applications ask for techniques that can explain their behavior. Existing solutions, such as Feature Guided Analysis (FGA), extract rules on their internal behaviors, e.g., by providing explanations related to neurons activation. Results from the literature show that these rules have considerable precision (i.e., they correctly predict certain classes of features), but the recall (i.e., the number of situations these rule apply) is more limited. To mitigate this problem, this paper presents Ensembles-based Feature Guided Analysis (EFGA). EFGA combines rules extracted by FGA into ensembles. Ensembles aggregate different rules to increase their applicability depending on an aggregation criterion, a policy that dictates how to combine rules into ensembles. Although our solution is extensible, and different aggregation criteria can be developed by users, in this work, we considered three different aggregation criteria. We evaluated how the choice of the criterion influences the effectiveness of EFGA on two benchmarks (i.e., the MNIST and LSC datasets), and found that different aggregation criteria offer alternative trade-offs between precision and recall. We then compare EFGA with FGA. For this experiment, we selected an aggregation criterion that provides a reasonable trade-off between precision and recall. Our results show that EFGA has higher train recall (+28.51% on MNIST, +33.15% on LSC), and test recall (+25.76% on MNIST, +30.81% on LSC) than FGA, with a negligible reduction on the test precision (-0.89% on MNIST, -0.69% on LSC).

关键词: Ensembles-based Feature Guided Analysis, Deep Neural Networks, Explainability, Rule Extraction, Precision-Recall Trade-off, MNIST, LSC, Aggregation Criteria

237. ❌ Heavy-Tailed and Long-Range Dependent Noise in Stochastic Approximation: A Finite-Time Analysis

作者: Siddharth Chandak, Anuj Yadav, Ayfer Ozgur, Nicholas Bambos 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19648v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究随机逼近（Stochastic Approximation, SA）在重尾和长程依赖噪声下的有限时间分析，属于数学优化和随机过程的理论研究。所有评分关键词均与大模型、深度学习技术原理或AI应用直接相关，而本文完全不涉及这些领域，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文研究了在重尾和长程依赖噪声下随机逼近算法的有限时间收敛性，首次建立了这两种非经典噪声模型下的显式收敛速率，并将理论框架应用于随机梯度下降和梯度博弈。

摘要翻译

随机逼近（Stochastic Approximation, SA）是一种基础的迭代框架，在强化学习和优化领域具有广泛应用。经典分析通常依赖于鞅差或具有有界二阶矩的马尔可夫噪声，但在包括金融和通信在内的许多实际场景中，常会遇到重尾和长程依赖（Long-Range Dependent, LRD）噪声。本文研究了在这些非经典噪声模型下，利用SA寻找强单调算子根的问题。我们首次建立了两种设定下的有限时间矩界，给出了显式收敛速率，以量化重尾性和时间依赖性的影响。我们的分析采用了一种噪声平均论证方法，该方法在不改变迭代过程的前提下，对噪声的影响进行了正则化处理。最后，我们将此通用框架应用于随机梯度下降（Stochastic Gradient Descent, SGD）和梯度博弈，并通过数值实验验证了我们的有限时间分析。

摘要 (Abstract)

Stochastic approximation (SA) is a fundamental iterative framework with broad applications in reinforcement learning and optimization. Classical analyses typically rely on martingale difference or Markov noise with bounded second moments, but many practical settings, including finance and communications, frequently encounter heavy-tailed and long-range dependent (LRD) noise. In this work, we study SA for finding the root of a strongly monotone operator under these non-classical noise models. We establish the first finite-time moment bounds in both settings, providing explicit convergence rates that quantify the impact of heavy tails and temporal dependence. Our analysis employs a noise-averaging argument that regularizes the impact of noise without modifying the iteration. Finally, we apply our general framework to stochastic gradient descent (SGD) and gradient play, and corroborate our finite-time analysis through numerical experiments.

关键词: Stochastic Approximation, Heavy-tailed Noise, Long-range Dependent Noise, Finite-time Analysis, Convergence Rates, Stochastic Gradient Descent, Gradient Play, Strongly Monotone Operator

238. ❌ RiboSphere: Learning Unified and Efficient Representations of RNA Structures

作者: Zhou Zhang, Hanqun Cao, Cheng Tan, Fang Wu, Pheng Ann Heng, Tianfan Fu 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19636v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文RiboSphere专注于RNA结构建模，属于AI for Science（特别是生物信息学）领域，因此与’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。论文提到使用预训练表示进行迁移学习，与’Pre-training OR Continual Pre-training OR Domain Adaptation’有一定关联（5分）。其他关键词主要涉及大语言模型（LLMs）的特定技术、训练方法、推理优化、代理系统等，而本文研究的是RNA结构的几何表示学习，使用几何Transformer和流匹配，并非基于LLMs或相关技术，因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文解决了RNA结构建模的挑战，通过结合向量量化和流匹配学习离散几何表示，实现了高保真结构重建，并在数据稀缺情况下有效迁移到逆折叠和RNA-配体结合预测任务。

摘要翻译

精确的RNA结构建模仍然面临挑战，这主要源于RNA骨架的高度灵活性、非规范相互作用的普遍存在以及实验确定的3D结构相对稀缺。我们引入了\emph{RiboSphere}，这是一个通过将矢量量化与流匹配相结合来学习RNA\emph{离散}几何表示的框架。我们的设计灵感源于RNA结构的模块化组织特性：复杂的折叠由重复出现的结构基序组合而成。RiboSphere使用一个几何变换器编码器来生成SE(3)等变（旋转/平移不变）特征，这些特征通过有限标量量化（Finite Scalar Quantization, FSQ）被离散化为一个有限的潜在编码词汇表。以这些离散编码为条件，一个流匹配解码器重建原子坐标，从而实现高保真度的结构生成。我们发现，学习到的编码索引富集了特定的RNA基序，这表明模型捕获的是基序层面的组合结构，而非仅仅充当纯粹的压缩瓶颈。在多项基准测试中，RiboSphere在结构重建方面表现出色（RMSD 1.25,Å，TM-score 0.84），其预训练的离散表示能有效迁移应用于逆折叠和RNA-配体结合预测任务，并在数据稀缺的情况下展现出稳健的泛化能力。

摘要 (Abstract)

Accurate RNA structure modeling remains difficult because RNA backbones are highly flexible, non-canonical interactions are prevalent, and experimentally determined 3D structures are comparatively scarce. We introduce \emph{RiboSphere}, a framework that learns \emph{discrete} geometric representations of RNA by combining vector quantization with flow matching. Our design is motivated by the modular organization of RNA architecture: complex folds are composed from recurring structural motifs. RiboSphere uses a geometric transformer encoder to produce SE(3)-invariant (rotation/translation-invariant) features, which are discretized with finite scalar quantization (FSQ) into a finite vocabulary of latent codes. Conditioned on these discrete codes, a flow-matching decoder reconstructs atomic coordinates, enabling high-fidelity structure generation. We find that the learned code indices are enriched for specific RNA motifs, suggesting that the model captures motif-level compositional structure rather than acting as a purely compressive bottleneck. Across benchmarks, RiboSphere achieves strong performance in structure reconstruction (RMSD 1.25,Å, TM-score 0.84), and its pretrained discrete representations transfer effectively to inverse folding and RNA–ligand binding prediction, with robust generalization in data-scarce regimes.

关键词: RNA structure modeling, geometric representations, vector quantization, flow matching, geometric transformer, SE(3)-invariant features, structure reconstruction, inverse folding

239. ❌ Alternating Diffusion for Proximal Sampling with Zeroth Order Queries

作者: Hirohane Takagi, Atsushi Nitanda 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19633v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究一种新的近似近端采样方法，专注于使用零阶信息进行扩散采样，属于概率采样和计算统计领域。论文内容与所有评分关键词（均围绕大模型、深度学习技术及其应用）完全无关，未涉及任何语言模型、训练技术、推理优化、对齐方法、代理系统或科学AI应用。

!!! tip deepseek-chat TL;DR

该论文提出了一种仅使用势函数零阶信息的新近似近端采样方法，通过将中间粒子分布建模为高斯混合并直接模拟动力学，避免了拒绝采样，在理论上继承了指数收敛性，并在实验中实现了快速收敛和确定性运行时间。

摘要翻译

本研究提出了一种仅利用势函数零阶信息的新型近似近端采样器。先前理论分析表明，近端采样对应于热流前向与后向迭代的交替执行。传统方法通过拒绝采样实现后向步骤，而本工作直接模拟该动力学过程。与基于扩散的采样方法（需通过学习模型或调用辅助采样器估计分数函数）不同，本方法将中间粒子分布建模为高斯混合模型，从而从可直接采样的分布中导出蒙特卡洛分数估计器。理论上，当分数估计误差得到充分控制时，在目标分布满足等周条件下，本方法继承了近端采样的指数收敛特性。在实际应用中，该算法避免了拒绝采样，允许灵活的步长设置，并以确定性运行时预算执行。数值实验表明，通过多粒子间的相互作用及并行计算的优化，本方法能快速收敛至目标分布。

摘要 (Abstract)

This work introduces a new approximate proximal sampler that operates solely with zeroth-order information of the potential function. Prior theoretical analyses have revealed that proximal sampling corresponds to alternating forward and backward iterations of the heat flow. The backward step was originally implemented by rejection sampling, whereas we directly simulate the dynamics. Unlike diffusion-based sampling methods that estimate scores via learned models or by invoking auxiliary samplers, our method treats the intermediate particle distribution as a Gaussian mixture, thereby yielding a Monte Carlo score estimator from directly samplable distributions. Theoretically, when the score estimation error is sufficiently controlled, our method inherits the exponential convergence of proximal sampling under isoperimetric conditions on the target distribution. In practice, the algorithm avoids rejection sampling, permits flexible step sizes, and runs with a deterministic runtime budget. Numerical experiments demonstrate that our approach converges rapidly to the target distribution, driven by interactions among multiple particles and by exploiting parallel computation.

关键词: proximal sampling, zeroth-order queries, diffusion-based sampling, Monte Carlo score estimator, Gaussian mixture, exponential convergence, deterministic runtime, parallel computation

240. ❌ On the role of memorization in learned priors for geophysical inverse problems

作者: Ali Siahkoohi, Davide Sabeddu 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19629v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究深度学习生成模型（扩散模型）在地球物理反问题中的应用，特别是训练数据稀缺时模型的记忆化问题及其对后验分布的影响。论文内容与大多数关键词（涉及大模型技术、训练方法、推理优化、智能体等）完全无关，仅与"AI for Science OR Bioinformatics OR Cheminformatics"有一定关联（5分），因为该研究属于AI在地球科学领域的应用，但未涉及生物信息学或化学信息学。

!!! tip deepseek-chat TL;DR

该论文研究了在训练数据稀缺时，基于深度生成模型的学习先验在地球物理反问题中可能出现的记忆化现象，并推导了扩散模型记忆化先验下的闭式高斯混合后验，通过全波形反演实验验证了记忆化对后验采样的影响。

摘要翻译

基于深度生成模型的学习先验为地震反演提供了数据驱动的正则化方法，但其训练需要具有代表性的地下模型数据集——这一资源在地球科学应用中本质上是稀缺的。由于大多数生成模型的训练目标可归结为在有限数据集上的最大似然估计，任何此类模型都可能收敛到经验分布——实质上是在记忆训练样本，而非学习潜在的地质分布。我们证明，在此类记忆化先验下的后验可简化为一个重加权的经验分布，即在存储的训练样本中进行似然加权的查找。具体对于扩散模型，记忆化会产生一个闭式形式的高斯混合先验，而将前向算子围绕每个训练样本线性化，则可得到一个高斯混合后验，其各分量的宽度和偏移由局部雅可比矩阵决定。我们在一个程式化的反问题上验证了这些预测，并通过全波形反演中的扩散后验采样展示了记忆化带来的后果。

摘要 (Abstract)

Learned priors based on deep generative models offer data-driven regularization for seismic inversion, but training them requires a dataset of representative subsurface models – a resource that is inherently scarce in geoscience applications. Since the training objective of most generative models can be cast as maximum likelihood on a finite dataset, any such model risks converging to the empirical distribution – effectively memorizing the training examples rather than learning the underlying geological distribution. We show that the posterior under such a memorized prior reduces to a reweighted empirical distribution – i.e., a likelihood-weighted lookup among the stored training examples. For diffusion models specifically, memorization yields a Gaussian mixture prior in closed form, and linearizing the forward operator around each training example gives a Gaussian mixture posterior whose components have widths and shifts governed by the local Jacobian. We validate these predictions on a stylized inverse problem and demonstrate the consequences of memorization through diffusion posterior sampling for full waveform inversion.

关键词: deep generative models, seismic inversion, memorization, diffusion models, geophysical inverse problems, learned priors, full waveform inversion, posterior sampling

241. ❌ Continual Learning for Food Category Classification Dataset: Enhancing Model Adaptability and Performance

作者: Piyush Kaushik Bhattacharyya, Devansh Tomar, Shubham Mishra, Divyanshu Rai, Yug Pratap Singh, Harsh Yadav, Krutika Verma, Vishal Meena, N Sangita Achary 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19624v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	5.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 论文提出了一种用于文本引导食物分类的持续学习框架，旨在解决传统机器学习模型难以识别训练集中未出现类别的问题。该研究属于机器学习应用领域，与大多数关键词（特别是大模型技术相关）无直接关联。仅与两个关键词有弱相关性：1）‘Pre-training OR Continual Pre-training OR Domain Adaptation’：论文涉及持续学习（continual learning），属于模型适应新领域/任务的技术，但未明确提及预训练或领域适应，因此给5分。2）‘AI for Science OR Bioinformatics OR Cheminformatics’：论文应用于食物分类、饮食监测和个性化营养规划，属于AI在科学/健康领域的应用，但非核心生物信息学或化学信息学，因此给5分。其余关键词均与大模型技术、推理、对齐、优化等无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种持续学习框架，用于文本引导的食物分类，使模型能够增量学习新食物类别而无需从头训练，从而提高了模型在饮食监测和个性化营养规划应用中的适应性和性能。

摘要翻译

传统机器学习流程通常难以识别原始训练集中未出现的类别。由于固定数据集很少能涵盖一个领域的全部多样性，这一缺陷往往会降低分类准确性。为解决该问题，我们提出一种用于文本引导食物分类的持续学习框架。与需要从头重新训练的方法不同，我们的方法支持增量更新，能够在整合新类别时不损害已有知识。例如，一个基于西餐训练的分类模型，后续可以学习识别印度豆米煎饼（dosa）或韩国泡菜（kimchi）等菜肴。尽管仍需进一步优化，该设计展现了在自适应食物识别领域的潜力，可应用于饮食监测与个性化营养规划。

摘要 (Abstract)

Conventional machine learning pipelines often struggle to recognize categories absent from the original trainingset. This gap typically reduces accuracy, as fixed datasets rarely capture the full diversity of a domain. To address this, we propose a continual learning framework for text-guided food classification. Unlike approaches that require retraining from scratch, our method enables incremental updates, allowing new categories to be integrated without degrading prior knowledge. For example, a model trained on Western cuisines could later learn to classify dishes such as dosa or kimchi. Although further refinements are needed, this design shows promise for adaptive food recognition, with applications in dietary monitoring and personalized nutrition planning.

关键词: continual learning, food classification, text-guided, incremental updates, model adaptability, dietary monitoring, personalized nutrition

242. ❌ On Performance Guarantees for Federated Learning with Personalized Constraints

作者: Mohammadjavad Ebrahimi, Daniel Burbano, Farzad Yousefian 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19617v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究联邦学习中的个性化约束优化问题，提出PC-FedAvg方法并分析其通信复杂度，在MNIST和CIFAR-10数据集上进行验证。所有评分关键词均涉及大模型、深度学习技术原理或AI科学应用，而本文专注于传统联邦学习的优化算法框架，未涉及大模型技术、训练方法、推理优化、AI代理或科学AI应用等主题，因此所有关键词相关度均为0。

!!! tip deepseek-chat TL;DR

该论文研究具有个性化约束的联邦学习优化问题，提出PC-FedAvg方法并证明其在子优化和可行性方面具有理论收敛保证，在标准数据集上验证了性能。

摘要翻译

联邦学习（Federated Learning, FL）作为一种通信高效的算法框架，已在多智能体分布式学习领域崭露头角。标准的联邦学习框架主要处理无约束或全局约束问题，然而许多实际场景涉及异构的资源或模型约束，导致优化问题具有智能体特定的可行集。本文研究一种个性化约束联邦优化问题，其中每个智能体关联一个凸局部目标函数和一个私有约束集。我们提出PC-FedAvg方法，该方法通过多块局部决策向量使每个智能体维护对其他智能体变量的交叉估计。每个智能体在本地更新所有块，仅对自身块中的不可行性施加惩罚。此外，交叉估计机制使得个性化过程无需在智能体之间达成共识或共享约束信息。我们在理论上证明了该方法在次优性方面达到$\mathcal{O}(ε^{-2})$的通信复杂度收敛速率，在智能体级不可行性方面达到$\mathcal{O}(ε^{-1})$的收敛速率。基于MNIST和CIFAR-10数据集的初步实验验证了我们的理论结论。

摘要 (Abstract)

Federated learning (FL) has emerged as a communication-efficient algorithmic framework for distributed learning across multiple agents. While standard FL formulations capture unconstrained or globally constrained problems, many practical settings involve heterogeneous resource or model constraints, leading to optimization problems with agent-specific feasible sets. Here, we study a personalized constrained federated optimization problem in which each agent is associated with a convex local objective and a private constraint set. We propose PC-FedAvg, a method in which each agent maintains cross-estimates of the other agents’ variables through a multi-block local decision vector. Each agent updates all blocks locally, penalizing infeasibility only in its own block. Moreover, the cross-estimate mechanism enables personalization without requiring consensus or sharing constraint information among agents. We establish communication-complexity rates of $\mathcal{O}(ε^{-2})$ for suboptimality and $\mathcal{O}(ε^{-1})$ for agent-wise infeasibility. Preliminary experiments on the MNIST and CIFAR-10 datasets validate our theoretical findings.

关键词: Federated Learning, Personalized Constraints, Distributed Optimization, Communication Complexity, Convex Optimization, Multi-agent Systems, PC-FedAvg, Feasibility Guarantees

243. ❌ Demonstrations, CoT, and Prompting: A Theoretical Analysis of ICL

作者: Xuhan Tong, Yuchen Zeng, Jiawei Zhang 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19611v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	8.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	10.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	15.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究In-Context Learning（ICL）和Chain-of-Thought（CoT）prompting，与关键词’In-context Learning OR Many-shot Learning’高度相关（15分），是论文的核心主题。论文明确研究LLMs和CoT，因此’Large Language Models OR LLMs OR Foundation Models’和’Chain of Thought OR CoT Reasoning OR Multi-step Reasoning’各得10分。论文提到pretraining是ICL的基础，因此’Pre-training OR Continual Pre-training OR Domain Adaptation’得8分。其他关键词如MoE、SLMs、SFT、RAG、量化等未在摘要中提及，与论文内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文通过理论分析研究了上下文学习（ICL）和思维链（CoT）提示如何影响预训练大语言模型的泛化性能，并推导了ICL测试损失的上界，表明演示质量、模型内在ICL能力和分布偏移程度共同决定性能，且CoT通过任务分解提升学习效果。

摘要翻译

上下文学习（In-Context Learning, ICL）使预训练大语言模型能够基于少量输入-输出示例进行条件化处理，从而适应下游任务，而无需更新任何参数。尽管已有许多理论尝试解释ICL的工作原理，但大多数研究要么依赖于较强的架构或数据假设，要么未能捕捉关键实践因素（如示例选择、思维链（Chain-of-Thought, CoT）提示、示例数量及提示模板）的影响。我们通过基于温和假设建立ICL的理论分析来弥补这一空白，该分析将这些设计选择与泛化行为联系起来。我们推导出ICL测试损失的上界，表明其性能受以下因素支配：（一）所选示例的质量，通过ICL损失沿连接测试提示与预训练样本路径的Lipschitz常数量化；（二）预训练模型固有的ICL能力；以及（三）分布偏移的程度。在同一框架下，我们将CoT提示分析为一种诱导任务分解的方法，并证明当每个子步骤的示例选择得当且生成的子任务更易学习时，CoT是有益的。最后，我们刻画了ICL性能对提示模板的敏感性如何随示例数量变化。综合而言，我们的研究表明：预训练赋予模型泛化至未见任务的能力，而CoT使模型能够将简单子任务组合为更复杂的任务，示例与指令则使其能够检索相似或复杂的任务（包括那些可组合为更复杂任务的任务），共同支持对未见任务的泛化。所有理论见解均通过实验得到验证。

摘要 (Abstract)

In-Context Learning (ICL) enables pretrained LLMs to adapt to downstream tasks by conditioning on a small set of input-output demonstrations, without any parameter updates. Although there have been many theoretical efforts to explain how ICL works, most either rely on strong architectural or data assumptions, or fail to capture the impact of key practical factors such as demonstration selection, Chain-of-Thought (CoT) prompting, the number of demonstrations, and prompt templates. We address this gap by establishing a theoretical analysis of ICL under mild assumptions that links these design choices to generalization behavior. We derive an upper bound on the ICL test loss, showing that performance is governed by (i) the quality of selected demonstrations, quantified by Lipschitz constants of the ICL loss along paths connecting test prompts to pretraining samples, (ii) an intrinsic ICL capability of the pretrained model, and (iii) the degree of distribution shift. Within the same framework, we analyze CoT prompting as inducing a task decomposition and show that it is beneficial when demonstrations are well chosen at each substep and the resulting subtasks are easier to learn. Finally, we characterize how ICL performance sensitivity to prompt templates varies with the number of demonstrations. Together, our study shows that pretraining equips the model with the ability to generalize beyond observed tasks, while CoT enables the model to compose simpler subtasks into more complex ones, and demonstrations and instructions enable it to retrieve similar or complex tasks, including those that can be composed into more complex ones, jointly supporting generalization to unseen tasks. All theoretical insights are corroborated by experiments.

关键词: In-Context Learning, Chain-of-Thought, Large Language Models, Theoretical Analysis, Generalization, Demonstration Selection, Prompt Templates, Pretraining

244. ❌ Wearable Foundation Models Should Go Beyond Static Encoders

作者: Yu Yvonne Wu, Yuwei Zhang, Hyungjun Yoon, Ting Dang, Dimitris Spathis, Tong Xia, Qiang Yang, Jing Han, Dong Ma, Sung-Ju Lee, Cecilia Mascolo 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19564v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	10.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	10.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文核心讨论可穿戴基础模型（WFMs）在健康监测中的应用，属于大模型在科学领域（医疗健康）的应用创新。高度相关的关键词包括：1）‘Large Language Models OR LLMs OR Foundation Models’（10分）- 论文明确讨论基础模型；2）‘Context Window Extension OR Long Context LLMs’（10分）- 论文强调长上下文推理和长期建模；3）‘LLM Agents OR Autonomous Agents OR Agentic Workflow’（10分）- 论文提出代理推理系统；4）‘AI for Science OR Bioinformatics OR Cheminformatics’（10分）- 论文属于AI在生物医学领域的应用。其他关键词如MoE、SFT、RAG等未在摘要中提及，评为0分。

!!! tip deepseek-chat TL;DR

论文指出当前可穿戴基础模型主要基于静态编码器进行短期健康监测，不适合长期慢性疾病建模，因此提出需要向纵向、预期性健康推理转变，包括结构化数据、纵向感知建模和代理推理系统三个基础性转变。

摘要翻译

可穿戴基础模型（Wearable foundation models, WFMs）通过利用经济型常开设备采集的大规模数据进行训练，已在短期、定义明确的健康监测任务中展现出强大性能，包括活动识别、健身追踪和心血管信号评估。然而，现有的大多数WFMs主要通过静态编码器将短时时间窗口映射到预定义标签，侧重于回顾性预测，而非对动态变化的个人历史、情境及未来风险轨迹进行推理。因此，这些模型难以适用于建模持续数周、数月或数年的慢性、进展性或偶发性健康状况。为此，我们认为WFMs必须超越静态编码器，并应明确设计用于纵向、前瞻性的健康推理。我们提出了实现这一转变所需的三个基础性变革：（1）结构丰富的数据，其超越孤立数据集或结果导向的采集方式，转向整合多模态、长期的个人轨迹及情境元数据，并最好由开放、可互操作的数据生态系统支持；（2）纵向感知的多模态建模，其优先考虑长上下文推理、时间抽象与个性化，而非横断面或群体层面的预测；（3）具主动性的推理系统，其超越静态预测，支持在不确定性下的规划、决策及基于临床依据的干预。这些变革共同将可穿戴健康监测从回顾性信号解读，重新定位为持续、前瞻且与人协同的健康支持体系。

摘要 (Abstract)

Wearable foundation models (WFMs), trained on large volumes of data collected by affordable, always-on devices, have demonstrated strong performance on short-term, well-defined health monitoring tasks, including activity recognition, fitness tracking, and cardiovascular signal assessment. However, most existing WFMs primarily map short temporal windows to predefined labels via static encoders, emphasizing retrospective prediction rather than reasoning over evolving personal history, context, and future risk trajectories. As a result, they are poorly suited for modeling chronic, progressive, or episodic health conditions that unfold over weeks, months or years. Hence, we argue that WFMs must move beyond static encoders and be explicitly designed for longitudinal, anticipatory health reasoning. We identify three foundational shifts required to enable this transition: (1) Structurally rich data, which goes beyond isolated datasets or outcome-conditioned collection to integrated multimodal, long-term personal trajectories, and contextual metadata, ideally supported by open and interoperable data ecosystems; (2) Longitudinal-aware multimodal modeling, which prioritizes long-context inference, temporal abstraction, and personalization over cross-sectional or population-level prediction; and (3) Agentic inference systems, which move beyond static prediction to support planning, decision-making, and clinically grounded intervention under uncertainty. Together, these shifts reframe wearable health monitoring from retrospective signal interpretation toward continuous, anticipatory, and human-aligned health support.

关键词: Wearable Foundation Models, Longitudinal Health Reasoning, Agentic Inference Systems, Multimodal Modeling, Personalized Health Monitoring, Chronic Health Conditions, Anticipatory Health Support, Contextual Metadata

245. ❌ Neural Uncertainty Principle: A Unified View of Adversarial Fragility and LLM Hallucination

作者: Dong-Xiao Zhang, Hu Lou, Jun-Jie Zhang, Jun Zhu, Deyu Meng 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19562v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	10.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	10.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	5.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文核心研究大语言模型（LLMs）的幻觉问题，并提出了一个统一的神经不确定性原理（NUP）来解释幻觉和对抗脆弱性，因此与’Large Language Models OR LLMs OR Foundation Models’和’Hallucination Mitigation OR Factuality OR Truthfulness’高度相关（10分）。论文从几何和不确定性角度提供了一种解释框架，与可解释性AI有一定关联（5分）。论文未涉及其他具体的大模型训练技术（如MoE、SFT、RLHF、PEFT等）、推理优化（如量化、推测解码）、应用范式（如RAG、智能体）或特定科学领域应用，因此这些关键词评分为0分。

!!! tip deepseek-chat TL;DR

该论文揭示了视觉模型的对抗脆弱性和大语言模型的幻觉具有共同的几何起源——输入与其损失梯度作为共轭可观测量受限于一个不可约的不确定性界限（神经不确定性原理），并基于此理论提出了无需对抗训练即可提升鲁棒性的方法（ConjMask, LogitReg）以及解码前的幻觉风险检测探针。

摘要翻译

视觉模型中的对抗脆弱性与大语言模型中的幻觉现象通常被视为两个独立问题，各自通过针对特定模态的修补方案进行处理。本研究首次揭示二者具有共同的几何起源：输入及其损失梯度构成一对共轭可观测量，受制于一个不可约的不确定性边界。通过在损失诱导的状态下形式化神经不确定性原理（Neural Uncertainty Principle, NUP），我们发现，在接近边界的区域中，进一步的压缩必然伴随着敏感度分散性的增加（即对抗脆弱性），而较弱的提示-梯度耦合则会导致生成过程约束不足（即幻觉）。关键在于，这一边界受到输入-梯度相关通道的调节，该通道可通过专门设计的单次反向探测进行捕捉。在视觉任务中，掩蔽高度耦合的输入成分可在无需昂贵对抗训练的情况下提升鲁棒性；在语言任务中，同一预填充阶段的探测可在生成任何答案标记之前检测幻觉风险。因此，NUP将两种看似分离的故障分类转化为共享的不确定性预算视角，并为可靠性分析提供了原则性的理论透镜。在此NUP理论指导下，我们提出ConjMask（掩蔽高贡献输入成分）和LogitReg（逻辑值侧正则化）方法，以在不依赖对抗训练的情况下提升鲁棒性，并将该探测用作大语言模型的免解码风险信号，实现幻觉检测与提示选择。NUP由此为诊断和缓解感知与生成任务中的边界异常提供了一个统一且实用的框架。

摘要 (Abstract)

Adversarial vulnerability in vision and hallucination in large language models are conventionally viewed as separate problems, each addressed with modality-specific patches. This study first reveals that they share a common geometric origin: the input and its loss gradient are conjugate observables subject to an irreducible uncertainty bound. Formalizing a Neural Uncertainty Principle (NUP) under a loss-induced state, we find that in near-bound regimes, further compression must be accompanied by increased sensitivity dispersion (adversarial fragility), while weak prompt-gradient coupling leaves generation under-constrained (hallucination). Crucially, this bound is modulated by an input-gradient correlation channel, captured by a specifically designed single-backward probe. In vision, masking highly coupled components improves robustness without costly adversarial training; in language, the same prefill-stage probe detects hallucination risk before generating any answer tokens. NUP thus turns two seemingly separate failure taxonomies into a shared uncertainty-budget view and provides a principled lens for reliability analysis. Guided by this NUP theory, we propose ConjMask (masking high-contribution input components) and LogitReg (logit-side regularization) to improve robustness without adversarial training, and use the probe as a decoding-free risk signal for LLMs, enabling hallucination detection and prompt selection. NUP thus provides a unified, practical framework for diagnosing and mitigating boundary anomalies across perception and generation tasks.

关键词: Neural Uncertainty Principle, LLM Hallucination, Adversarial Fragility, Uncertainty Bound, Hallucination Mitigation, Robustness, Input-gradient Correlation, Decoding-free Detection

246. ❌ An Adaptive Machine Learning Framework for Fluid Flow in Dual-Network Porous Media

作者: V. S. Maduri, K. B. Nakshatrala 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19561v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究的是基于物理信息神经网络（PINN）的双孔隙介质流体流动建模框架，属于科学计算和计算流体力学领域。论文中未涉及任何大语言模型、深度学习技术原理创新或生物医药AI应用。虽然属于AI在科学领域的应用（物理建模），但未使用大模型技术，也未涉及生物信息学或化学信息学。因此，除’AI for Science’关键词因属于科学计算应用获得5分（有一定关联）外，其余所有关键词均完全无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个基于物理信息神经网络的框架，用于双孔隙介质系统的正向和逆向建模，有效解决了复杂几何形状下的流体流动预测和参数识别问题。

摘要翻译

多孔材料——无论是天然的还是人工设计的——常呈现双孔隙网络结构，该结构控制着致密页岩中矿物勘探与碳氢化合物开采等过程。双孔隙度/渗透率（DPP）数学模型描述了不可压缩流体流经两个相互作用的孔隙网络并伴随网络间质量交换的过程。尽管数值方法已取得显著进展，但仍需能够实现快速预测、数据同化和可靠反演分析的计算框架。为此，我们提出了一种基于物理信息神经网络（PINN）的框架，用于DPP系统的正演与反演建模。该方法将混合形式的控制方程及边界条件直接编码至损失函数中，并采用自适应加权策略以平衡各项贡献。该框架的关键特征包括自适应权重调整、动态配置点选取，以及使用共享主干神经架构以高效捕捉双孔隙网络的耦合行为。该方法本质上是无网格的，因此非常适用于多孔介质中常见的复杂几何结构。它能精确捕捉层状区域中解场的不连续性，而不会引入经典有限元格式中常见的伪振荡现象。重要的是，该框架非常适合反演分析，能够在关键物理量（如DPP模型中的传质系数）难以直接测量的场景下实现稳健的参数识别。此外，本文提供了系统的收敛性分析，以严格评估该方法的稳定性、准确性和可靠性。通过一系列代表性数值实验，验证了该方法的有效性和计算优势。

摘要 (Abstract)

Porous materials – natural or engineered – often exhibit dual pore-network structures that govern processes such as mineral exploration and hydrocarbon recovery from tight shales. Double porosity/permeability (DPP) mathematical models describe incompressible fluid flow through two interacting pore networks with inter-network mass exchange. Despite significant advances in numerical methods, there remains a need for computational frameworks that enable rapid forecasting, data assimilation, and reliable inverse analysis. To address this, we present a physics-informed neural network (PINN) framework for forward and inverse modeling of DPP systems. The proposed approach encodes the governing equations in mixed form, along with boundary conditions, directly into the loss function, with adaptive weighting strategies to balance their contributions. Key features of the framework include adaptive weight tuning, dynamic collocation point selection, and the use of shared trunk neural architectures to efficiently capture the coupled behavior of the dual pore networks. It is inherently mesh-free, making it well-suited for complex geometries typical of porous media. It accurately captures discontinuities in solution fields across layered domains without introducing spurious oscillations commonly observed in classical finite element formulations. Importantly, the framework is well-suited for inverse analysis, enabling robust parameter identification in scenarios where key physical quantities – such as the mass transfer coefficient in DPP models – are difficult to measure directly. In addition, a systematic convergence analysis is provided to rigorously assess the stability, accuracy, and reliability of the method. The effectiveness and computational advantages of the approach are demonstrated through a series of representative numerical experiments.

关键词: physics-informed neural network, dual porosity/permeability, fluid flow, inverse modeling, adaptive weighting, mesh-free method, porous media, parameter identification

247. ❌ Learning to Bet for Horizon-Aware Anytime-Valid Testing

作者: Ege Onur Taga, Samet Oymak, Shubhanshu Shekhar 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19551v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 论文研究的是统计假设检验中的有限时间最优控制问题，使用赌博/过程框架和深度强化学习（DQN）来学习最优投注策略，以在严格截止时间下进行有效的假设检验。论文内容完全属于统计学、最优控制和强化学习领域，与所有提供的大模型、深度学习技术原理、AI科学应用等关键词均无直接关联。论文中未提及任何语言模型、模型训练、推理优化、AI代理或科学AI应用相关内容。

!!! tip deepseek-chat TL;DR

该论文研究了在严格截止时间下，如何通过深度强化学习（DQN）学习最优投注策略，以开发有限时间有效的假设检验和置信序列，并在实验中取得了最先进的结果。

摘要翻译

我们针对严格截止期限$N$下的有界均值问题，开发了具有时限感知的任意时间有效检验与置信序列。基于投注/电子过程框架，我们将时限感知投注建模为状态空间$(t, \log W_t)$下的有限时域最优控制问题，其中$t$表示时间，$W_t$为检验鞅值。我们首先证明在状态空间的某些内部区域，显著偏离凯利投注的策略可被严格证明是次优的，而凯利投注策略能以高概率达到阈值。随后，我们提出了充分条件以表明：在该区域之外，若投注者进度落后，采用比凯利更激进的投注策略可能更优；若进度超前，则更保守的策略可能更佳。综合这些结果，我们在$(t, \log W_t)$平面上构建了一个简单的相图，划分出凯利投注、分数凯利投注和激进投注各自适用的区域。基于此相图的指导，我们引入了一种深度强化学习方法，该方法采用通用深度Q网络智能体，通过合成经验学习单一策略，并将历史观测值的简单统计量映射到不同时域和零值下的投注决策。在有限时域实验中，所学习的DQN策略取得了当前最优的结果。

摘要 (Abstract)

We develop horizon-aware anytime-valid tests and confidence sequences for bounded means under a strict deadline $N$. Using the betting/e-process framework, we cast horizon-aware betting as a finite-horizon optimal control problem with state space $(t, \log W_t)$, where $t$ is the time and $W_t$ is the test martingale value. We first show that in certain interior regions of the state space, policies that deviate significantly from Kelly betting are provably suboptimal, while Kelly betting reaches the threshold with high probability. We then identify sufficient conditions showing that outside this region, more aggressive betting than Kelly can be better if the bettor is behind schedule, and less aggressive can be better if the bettor is ahead. Taken together these results suggest a simple phase diagram in the $(t, \log W_t)$ plane, delineating regions where Kelly, fractional Kelly, and aggressive betting may be preferable. Guided by this phase diagram, we introduce a Deep Reinforcement Learning approach based on a universal Deep Q-Network (DQN) agent that learns a single policy from synthetic experience and maps simple statistics of past observations to bets across horizons and null values. In limited-horizon experiments, the learned DQN policy yields state-of-the-art results.

关键词: anytime-valid testing, confidence sequences, betting framework, optimal control, deep reinforcement learning, DQN, finite-horizon, hypothesis testing

248. ❌ Verifiable Error Bounds for Physics-Informed Neural Network Solutions of Lyapunov and Hamilton-Jacobi-Bellman Equations

作者: Jun Liu 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19545v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究物理信息神经网络（PINNs）在求解Lyapunov和Hamilton-Jacobi-Bellman偏微分方程中的可验证误差界，属于AI在科学计算领域的应用。所有关键词均与大语言模型（LLMs）及其相关技术（如训练、对齐、推理优化、智能体等）直接相关，而本文完全不涉及LLMs，仅涉及传统的神经网络在特定科学问题中的应用。唯一略有相关的是“AI for Science OR Bioinformatics OR Cheminformatics”，因为PINNs属于AI在科学（具体是计算数学和控制理论）中的应用，但论文未涉及生物信息学或化学信息学，且核心是数学理论证明而非典型的AI for Science创新应用，因此给予5分（有一定关联）。其他关键词与论文内容完全无关，均得0分。

!!! tip deepseek-chat TL;DR

该论文针对物理信息神经网络（PINNs）求解Lyapunov和Hamilton-Jacobi-Bellman偏微分方程时缺乏严格误差保证的问题，提出了可验证的误差界方法，并证明了残差界能推导出解的相对误差界、后验估计以及控制策略的最优性间隙。

摘要翻译

非线性系统分析与控制中的许多核心问题可转化为求解偏微分方程（PDE），例如李雅普诺夫方程和哈密顿-雅可比-贝尔曼（Hamilton-Jacobi-Bellman，HJB）方程。物理信息神经网络（Physics-Informed Neural Networks，PINNs）作为一种无网格方法，在近似求解此类方程方面展现出潜力，但在现有研究中，通常缺乏严格的理论保证以证明较小的PDE残差必然对应较小的解误差。本文针对李雅普诺夫方程和HJB方程的近似解，建立了可验证的误差界，并特别聚焦于基于PINN的近似方法。对于这两类PDE，我们证明可验证的残差界能够导出相对于真实解的相对误差界，并基于近似解给出可计算的后验估计。对于HJB方程，该方法还能在紧致子水平集上为最优值函数提供经认证的上下界，并量化所推导反馈策略的最优性间隙。我们进一步证明，单侧残差界已足以保证近似解本身构成有效的李雅普诺夫函数或控制李雅普诺夫函数。最后，我们通过数值算例对理论结果进行了验证。

摘要 (Abstract)

Many core problems in nonlinear systems analysis and control can be recast as solving partial differential equations (PDEs) such as Lyapunov and Hamilton-Jacobi-Bellman (HJB) equations. Physics-informed neural networks (PINNs) have emerged as a promising mesh-free approach for approximating their solutions, but in most existing works there is no rigorous guarantee that a small PDE residual implies a small solution error. This paper develops verifiable error bounds for approximate solutions of Lyapunov and HJB equations, with particular emphasis on PINN-based approximations. For both the Lyapunov and HJB PDEs, we show that a verifiable residual bound yields relative error bounds with respect to the true solutions as well as computable a posteriori estimates in terms of the approximate solutions. For the HJB equation, this also yields certified upper and lower bounds on the optimal value function on compact sublevel sets and quantifies the optimality gap of the induced feedback policy. We further show that one-sided residual bounds already imply that the approximation itself defines a valid Lyapunov or control Lyapunov function. We illustrate the results with numerical examples.

关键词: Physics-informed neural networks, Lyapunov equations, Hamilton-Jacobi-Bellman equations, Error bounds, Partial differential equations, Nonlinear systems, Control theory, Verifiable guarantees

249. ❌ Scalable Cross-Facility Federated Learning for Scientific Foundation Models on Multiple Supercomputers

作者: Yijiang Li, Zilinghan Li, Kyle Chard, Ian Foster, Todd Munson, Ravi Madduri, Kibaek Kim 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19544v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	8.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	8.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 论文主要研究跨高性能计算设施的联邦学习框架，用于科学基础模型训练。与’Large Language Models OR LLMs OR Foundation Models’相关度8分，因为论文涉及大型语言模型的微调；与’Post-training OR Supervised Fine-tuning OR SFT’相关度8分，因为论文通过微调LLM验证科学应用；与’AI for Science OR Bioinformatics OR Cheminformatics’相关度10分，因为论文明确针对科学应用，并在化学数据集上验证。其他关键词如MoE、SLMs、Scaling Laws、RLHF等与论文核心内容无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一个跨高性能计算设施的联邦学习框架，解决了科学应用中大规模模型训练的数据隐私和资源挑战，并通过在化学数据集上微调大型语言模型验证了其科学实用性。

摘要翻译

面向科学应用的人工智能日益需要在因隐私限制、数据主权或数据生成体量巨大而无法集中化的数据上训练大型模型。联邦学习通过在不集中原始数据的情况下实现协同训练来解决这一问题，但科学应用所需的模型规模要求大量计算资源，这些资源通常由高性能计算设施提供。在高性能计算设施间部署联邦学习实验带来了超越云或企业环境的挑战。我们提出了一种面向异构高性能计算环境的综合性跨设施联邦学习框架，该框架基于高级隐私保护联邦学习框架，并利用Globus Compute与Transfer进行编排，在美国能源部四台领导级超级计算机上进行了评估。我们证明了跨高性能计算设施的联邦学习实验是实际可行的，分析了影响训练性能的关键异构性来源，并表明在现实的高性能计算调度条件下，算法选择至关重要。我们通过在化学指令数据集上微调一个大语言模型验证了其科学适用性，并指出面向调度器的算法设计是未来部署面临的关键开放挑战。

摘要 (Abstract)

Artificial Intelligence for scientific applications increasingly requires training large models on data that cannot be centralized due to privacy constraints, data sovereignty, or the sheer volume of data generated. Federated learning (FL) addresses this by enabling collaborative training without centralizing raw data, but scientific applications demand model scales that requires extensive computing resources, typically offered at High Performance Computing (HPC) facilities. Deploying FL experiments across HPC facilities introduces challenges beyond cloud or enterprise settings. We present a comprehensive cross-facility FL framework for heterogeneous HPC environments, built on Advanced Privacy-Preserving Federated Learning (APPFL) framework with Globus Compute and Transfer orchestration, and evaluate it across four U.S. Department of Energy (DOE) leadership-class supercomputers. We demonstrate that FL experiments across HPC facilities are practically achievable, characterize key sources of heterogeneity impacting the training performance, and show that algorithmic choices matter significantly under realistic HPC scheduling conditions. We validate the scientific applicability by fine-tuning a large language model on a chemistry instruction dataset, and identify scheduler-aware algorithm design as a critical open challenge for future deployments.

关键词: Federated Learning, High Performance Computing, Scientific Foundation Models, Cross-facility Training, Large Language Models, Chemistry Instruction Dataset, Model Fine-tuning, HPC Scheduling

250. ❌ Multimodal branched transport infers anatomically aligned brain reaction maps

作者: Cristian Mendico 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19761v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究大脑反应映射的传播架构，使用多模态数据（BOLD响应、电生理学、纤维束成像）和变分优化推断分支传输结构，属于神经科学和计算建模领域。所有关键词均与大模型、深度学习技术原理或AI应用直接相关，但论文未涉及任何大模型、深度学习、AI技术或算法，仅最后一关键词“AI for Science OR Bioinformatics OR Cheminformatics”因论文属于科学计算应用（神经科学）而获得5分（有一定关联），其余关键词完全无关（0分）。

!!! tip deepseek-chat TL;DR

该论文通过结合多模态神经影像数据和变分优化方法，推断大脑中刺激-反应的分支传播架构，揭示了各向异性成本如何重塑路由主干，并量化了几何效率与动态可控性之间的权衡。

摘要翻译

外部刺激如何转化为分布式反应模式，在传播架构层面仍未得到解决。现有的大规模控制模型可在预设网络上量化转移成本，但未能从源活动与目标活动中推断出路由图本身。本研究结合任务相关的血氧水平依赖响应、源重建电生理学以及纤维束成像衍生的各向异性数据，估算了刺激与反应指标，定义了解剖传输成本，并通过变分优化推断出分支化传播架构。与标准传输模型不同，分支化传输倾向于在信号重新分配前将其汇聚至共享的神经高速通道。我们进一步将随机图诱导动力学附加至推断出的路由图中，并量化了几何效率与动态可控性之间的权衡关系。研究表明：多模态数据可生成解剖对齐的脑反应图谱；相较于各向同性基线，各向异性成本会定性重塑路由主干结构；而几何-动力学混合优化揭示了不同分支机制间非平凡的排序逆转现象。

摘要 (Abstract)

How external stimulation is transformed into distributed reaction patterns remains unresolved at the level of propagation architecture. Existing large-scale control models quantify transition costs on prescribed networks but do not infer the routing map itself from source and target activity. Here we combine task-related blood-oxygen-level-dependent responses, source-reconstructed electrophysiology and tractography-derived anisotropy to estimate stimulation and reaction measures, define an anatomical transport cost, and infer a branched propagation architecture by variational optimisation. Unlike standard transport formulations, branched transport favours aggregation of signal into shared neural highways before redistribution. We further attach a stochastic graph-induced dynamics to the inferred map and quantify the trade-off between geometric efficiency and dynamical controllability. We show that multimodal data generate anatomically aligned brain reaction maps, that anisotropic costs qualitatively reshape routing backbones relative to isotropic baselines, and that hybrid geometric–dynamical optimisation reveals non-trivial rank reversals across branching regimes.

关键词: branched transport, brain reaction maps, multimodal data, variational optimization, anisotropic costs, propagation architecture, neural highways, geometric-dynamical optimization

251. ❌ Branched Optimal Transport for Stimulus to Reaction Brain Mapping

作者: Cristian Mendico 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19751v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究的是系统神经科学中的刺激-反应脑映射问题，提出了一种基于分支最优传输的变分框架来推断大脑中的传播网络架构。论文内容完全集中在数学建模、最优传输理论、神经科学和计算神经科学领域，没有涉及任何大语言模型、深度学习、AI技术或相关应用。所有评分关键词都直接与大模型、深度学习技术及其应用相关，而该论文的研究主题、方法、术语和内容与这些关键词没有任何关联，因此所有关键词的相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文针对系统神经科学中外部刺激如何在大脑中传播产生反应的问题，提出了一种基于分支最优传输的变分框架，通过推断连接刺激源和反应目标的图/流来构建大脑反应映射，并证明了离散和连续形式下最小化器的存在性。

摘要翻译

系统神经科学的一个核心问题是确定外部刺激如何通过大脑传播以产生反应。现有的确定性与随机控制模型能够在预设网络上量化脑状态间的转移成本，但并未将传输网络本身视为未知对象。本文提出一个变分框架，其推断对象是连接刺激源测度与反应目标测度的图/流。该模型被构建为各向异性分支最优输运问题，其中通量成本的凹性促进聚集与分支现象。最优流的支撑集定义了刺激到反应的路径架构，可解释为大脑反应图谱。我们证明了离散与连续形式下极小元的存在性，并引入一种混合随机扩展，将分支输运与诱导图动力学上路径空间的Kullback-Leibler控制成本相结合。该方法提供了一种数学机制，用于推断传播架构而非在固定基底上控制轨迹。

摘要 (Abstract)

A central problem in systems neuroscience is to determine how an external stimulation is propagated through the brain so as to produce a reaction. Current deterministic and stochastic control models quantify transition costs between brain states on a prescribed network, but do not treat the transport network itself as an unknown. Here we propose a variational framework in which the inferred object is a graph/current connecting a stimulation source measure to a reaction target measure. The model is posed as an anisotropic branched optimal transport problem, where concavity of the flux cost promotes aggregation and branching. The support of an optimal current defines a stimulus-to-reaction routing architecture, interpreted as a brain reaction map. We prove existence of minimizers in discrete and continuous formulations and introduce a hybrid stochastic extension combining ramified transport with a path-space Kullback–Leibler control cost on the induced graph dynamics. This approach provides a mathematical mechanism for inferring propagation architectures rather than controlling trajectories on fixed substrates.

关键词: branched optimal transport, stimulus to reaction mapping, systems neuroscience, variational framework, brain reaction map, propagation architectures, graph dynamics, Kullback-Leibler control

252. ❌ Stochastic Averaging and Statistical Inference of Glycolytic Pathway

作者: Arnab Ganguly, Hye-Won Kang 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19577v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于糖酵解通路的随机平均和统计推断，属于计算生物学/生物信息学领域。论文内容涉及随机过程、常微分方程模型、参数估计等数学和统计方法，与绝大多数关键词（如大模型、微调、推理加速、对齐等）完全无关。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为论文属于生物信息学/计算生物学范畴，但论文本身并未使用深度学习或大模型技术，而是基于传统的概率论和统计学方法，因此仅给予5分（有一定关联）。

!!! tip deepseek-chat TL;DR

该研究为糖酵解通路建立了一个从随机反应网络推导简化确定性ODE模型的严格概率框架，并证明了在仅观测慢变量时参数估计的统计一致性。

摘要翻译

许多生物过程呈现出振荡行为。其中，糖酵解振荡因其生化反应网络已被充分表征而得到广泛研究。然而，这些网络的复杂性需要借助低维常微分方程模型来识别核心机制并进行稳定性分析。尽管先前的研究提出了简化的常微分方程模型，但这些模型通常是从确定性描述中引入的，而非基于更能准确反映随机时间点发生的离散反应事件的底层随机动力学。本文建立了一个严格的概率框架，用于从随机表述中推导糖酵解途径的简化Othmer-Aldridge模型。完整系统被建模为具有不同时间和丰度尺度的多尺度连续时间马尔可夫链。在适当的尺度化机制和特定结构条件下，我们证明了慢变分量的动力学可由一个二维常微分方程近似描述。由于网络的复杂性及其组分间的强耦合性，该证明在技术上具有挑战性。我们进一步考虑了当观测仅限于慢变量——果糖-6-磷酸（fructose-6-phosphate）和ADP——时的参数估计问题。简化系统产生了一个仅依赖于这些变量的可处理的损失函数。我们证明了当数据来源于完整随机反应网络时，所得估计量具有统计一致性。这些结果共同提供了一个数学上严谨的框架，将随机生化反应网络、简化确定性动力学以及统计可靠的参数估计联系起来。

摘要 (Abstract)

Many biological processes exhibit oscillatory behavior. Among these, glycolytic oscillations have been extensively studied due to their well-characterized biochemical reaction networks. However, the complexity of these networks necessitates low-dimensional ordinary differential equation (ODE) models to identify core mechanisms and perform stability analysis. While previous studies proposed reduced ODE models, these were typically introduced from deterministic descriptions rather than the underlying stochastic dynamics, which more accurately represent discrete reaction events occurring at random times. In this paper, we develop a rigorous probabilistic framework for deriving a reduced Othmer-Aldridge model of the glycolytic pathway from its stochastic formulation. The full system is modeled as a multiscale continuous-time Markov chain with different time and abundance scales. Under an appropriate scaling regime and specific structural conditions, we prove that the dynamics of the slow components are approximated by a two-dimensional ODE. The proof is technically involved due to the network’s complexity and strong coupling between its components. We further consider the problem of parameter estimation when observations are limited to the slow species: fructose-6-phosphate and ADP. The reduced system yields a tractable loss function depending solely on these variables. We prove that the resulting estimators are statistically consistent when the data originate from the full stochastic reaction network. Together, these results provide a mathematically rigorous framework linking stochastic biochemical reaction networks, reduced deterministic dynamics, and statistically reliable parameter estimation.

关键词: glycolytic oscillations, stochastic dynamics, ordinary differential equation (ODE) models, parameter estimation, multiscale continuous-time Markov chain, statistical consistency, reduced model, biochemical reaction networks

253. ❌ Prediction and Experimental Verification of Electrolyte Solvation Structure from an OMol25-Trained Interatomic Potential

作者: Nitesh Kumar, Jianwei Lai, Casey S. Mezerkor, Jiaqi Wang, Kamila M. Wiaderek, J. David Bazak, Samuel M. Blau, Ethan J. Crumlin 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20183v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于机器学习原子间势（MLIPs）在电池电解质模拟中的应用，属于AI for Science（科学AI）领域，与关键词’AI for Science OR Bioinformatics OR Cheminformatics’高度相关（10分）。然而，论文未涉及大语言模型（LLMs）、深度学习技术原理创新或任何其他评分关键词（如MoE、SFT、RAG等），因此其他所有关键词均得0分。

!!! tip deepseek-chat TL;DR

该研究评估了基于OMol25数据集预训练的机器学习原子间势（MLIPs）在预测钠离子电池电解质溶剂化结构方面的准确性，并通过实验验证发现其比仅基于无机材料数据训练的模型能更准确地预测密度和X射线结构因子。

摘要翻译

基于大规模化学多样性数据集训练的机器学习原子间势（MLIPs）正在革新计算化学领域，其能以接近密度泛函理论（DFT）的精度、比DFT快万倍以上的速度进行电池电解质的分子动力学模拟。尽管以往适用于电解质研究的MLIP训练数据集主要基于无机材料构建，但Open Molecules 2025（OMol25）数据集提供了具有广泛元素覆盖度的大规模分子DFT MLIP训练数据，并专门采样了数千万个电解质构型。本研究结合计算建模与实验验证，系统评估了基于材料数据预训练的大规模MLIP与基于OMol25预训练的MLIP，在不同物理化学条件和组成范围内准确解析钠离子电池电解质纳米尺度结构组织及离子溶剂化特性的能力。我们发现，与仅基于无机材料数据训练的最先进模型相比，经OMol25训练的通用原子模型（UMA-OMol）对实验测得的密度和X射线结构因子的预测吻合度显著更优。利用UMA-OMol，我们进一步分析了溶剂化结构随阳离子种类、阴离子化学性质、盐浓度及溶剂拓扑结构变化的系统趋势。我们观察到，体系温度升高会加剧溶剂化环境的不均一性，扰动阳离子-溶剂相互作用，并促进接触离子对（CIPs）的形成。此外，基于甘醇二甲醚的电解质中溶剂拓扑结构的细微变化，会引起离子关联和溶剂化结构的显著改变。本文展示的实验吻合度与微观机理解析表明，经OMol25训练的MLIPs为实现预测性、高通量的电解质模拟提供了一条实用路径。

摘要 (Abstract)

Machine learning interatomic potentials (MLIPs) trained on large, chemically diverse datasets are revolutionizing computational chemistry, enabling molecular dynamics simulations of battery electrolytes with near-DFT accuracy over 10,000 times faster than DFT. While previous MLIP training datasets with suitable elemental coverage for electrolytes have been based on inorganic materials, the Open Molecules 2025 (OMol25) dataset provides large-scale molecular DFT MLIP training data with broad elemental coverage and specifically samples tens of millions of electrolyte configurations. Here, we integrate computational modeling with experimental validation to systematically assess the ability of large-scale MLIPs pre-trained on materials data or on OMol25 to accurately resolve nanoscale structural organization and ion-solvation characteristics in Na-ion battery electrolytes across diverse physicochemical conditions and compositional regimes. We find that the OMol25-trained Universal Model of Atoms (UMA-OMol) predicts experimentally measured densities and X-ray structure factors in substantially better agreement compared to state-of-the-art models trained only on inorganic materials data. Using UMA-OMol, we further analyze systematic trends in solvation structure as a function of cation identity, anion chemistry, salt concentration, and solvent topology. We observe that increasing system temperature amplifies the heterogeneity within the solvation environment, perturbing cation-solvent interactions and promoting the formation of contact ion pairs (CIPs). Moreover, subtle variations in the solvent topology of glyme-based electrolytes cause pronounced changes in ion-correlations and solvation structure. The experimental agreement and microscopic insights shown here position OMol25-trained MLIPs as a practical route to predictive, high-throughput electrolyte simulations.

关键词: machine learning interatomic potentials, MLIPs, electrolyte solvation structure, OMol25 dataset, Na-ion battery, molecular dynamics simulations, experimental validation, Universal Model of Atoms

254. ❌ Occupancy Extrapolation: Reaching Many Excited Electronic States from Ground State Calculations

作者: Yichen Fan, Weitao Yang 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.20055v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文专注于计算化学中的密度泛函理论（DFT）方法，特别是ΔSCF方法，用于计算电子激发态能量。论文的核心贡献是提出了一种名为“occupancy extrapolation (OE)”的新方法，该方法基于Landau Fermi液体理论，通过泰勒展开来高效计算激发态能量。所有关键词（除了最后一个）都明确涉及大语言模型（LLMs）、深度学习技术、模型训练、推理优化、对齐、代理系统等主题，这些与论文的计算化学焦点完全无关。最后一个关键词“AI for Science OR Bioinformatics OR Cheminformatics”得5分，因为论文属于“AI for Science”的广义范畴（计算化学是科学AI的一个子领域），但它不涉及生物信息学或化学信息学的具体应用，也不是基于大模型或深度学习的创新，而是传统的量子化学方法改进。

!!! tip deepseek-chat TL;DR

该论文提出了一种基于密度泛函理论和Landau Fermi液体理论的occupancy extrapolation方法，用于从基态计算高效且准确地预测多种电子激发态能量。

摘要翻译

ΔSCF密度泛函理论方法将体系能量定义为轨道占据数的函数。受朗道费米液体理论启发，我们发展了一种占据数外推法，该方法通过对参考态占据数涨落进行能量泰勒展开来获取激发态能量。占据数外推法保留了ΔSCF的物理本质，同时为激发能提供了准粒子能量及其广义屏蔽相互作用之和的物理解释。该方法以O(N³)计算成本获得精确的价态、里德堡态和电荷转移激发能，避免了为每个激发态进行独立自洽场计算，并能够基于基态计算实现高效的大规模激发态模拟。

摘要 (Abstract)

The $Δ$SCF DFT approach defines the system energy as a function of orbital occupancy. Inspired by Landau Fermi liquid theory, we develop an occupancy extrapolation (OE) method that captures excited-state energies via a Taylor expansion of the energy with respect to occupation fluctuation from a reference state. OE retains the physics of $Δ$SCF while offering a physical interpretation of excitation energies as sums of quasiparticle energies and their generalized screened interactions. It yields accurate valence, Rydberg, and charge-transfer excitation energies at $O(N^3)$ cost, avoids separate SCF calculations for each excited state, and enables efficient large-scale excited-state simulations from ground-state calculations.

关键词: ΔSCF DFT, occupancy extrapolation, excited-state energies, Landau Fermi liquid theory, Taylor expansion, quasiparticle energies, charge-transfer excitation, O(N^3) cost

255. ❌ Coupled cluster theory for positron binding in anions and polyatomic molecules

作者: Rosario R. Riso, Jan Haakon M. Trabski, Federico Rossi, Dermot Green, Henrik Koch 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19948v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文研究的是计算化学中的正电子耦合簇方法（POS-CCSD），用于计算分子中正电子结合能，属于量子化学计算领域。所有关键词均与大模型、深度学习、AI技术原理或应用相关，而本文完全不涉及这些主题。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为本文属于科学计算（计算化学）领域，但论文并未使用AI或机器学习方法，而是基于传统的量子化学理论方法，因此仅给予5分（有一定关联）。其他关键词均完全无关，得0分。

!!! tip deepseek-chat TL;DR

该论文提出了一种正电子耦合簇单双激发（POS-CCSD）方法，用于计算分子中正电子结合能，并在原子阴离子和多原子分子上进行了基准测试，结果表明电子关联效应对这些复杂系统的描述至关重要。

摘要翻译

本文提出正电子耦合簇单双激发（POS-CCSD）方法，用于计算分子的正电子结合能。该框架以平等方式处理电子与正电子，并包含高达双电子-单正电子同时激发的耦合效应。我们通过计算原子阴离子及若干极性与非极性多原子体系的正电子结合能，将结果与独立理论研究及现有实验数据进行对比，从而验证该方法。对于H$^{-}$体系，完全收敛的计算结果与量子蒙特卡洛及多参考组态相互作用方法高度吻合。由于电子与正电子轨道基组规模对结合能收敛速度的影响，本研究尚未实现与实验数据的定量一致。然而，POS-CCSD结果凸显了电子关联效应在电子-正电子体系描述中的关键作用，这对于平衡描述此类复杂系统至关重要。此外，我们还考察了LiH分子中正电子附着后的核弛豫效应。

摘要 (Abstract)

We present the positron coupled cluster singles and doubles (POS-CCSD) method to calculate positron binding energies in molecules. This framework treats electrons and positrons on an equal footing and includes up to simultaneous double-electron-single-positron excitations. We benchmark the approach by computing binding energies for atomic anions and several polar and non-polar polyatomic systems, comparing the results with independent theoretical studies and, where available, experimental data. The fully converged results for H$^{-}$ are in excellent agreement with quantum Monte Carlo and multi-reference configuration interaction results. Quantitative agreement with experiments is not reached in the present study due to the slow convergence of the binding energy with respect to the size of the orbital bases for the electrons and the positron. However, the POS-CCSD results underscore the critical role of electron correlation in the description of electron-positron systems required for a balanced description of these complex systems. In addition, we examine nuclear relaxation effects following positron attachment in LiH.

关键词: positron binding, coupled cluster theory, POS-CCSD, electron-positron systems, binding energies, quantum chemistry, molecular systems, electron correlation

256. ❌ Data-Efficient Active Learning Discovery of Transition Metal Photosensitizers for Type I Photodynamic Therapy

作者: Alessio Fallani, Pi A. B. Haase, Julianne F. F. Eckert, Luukas Nikkanen, Sherri A. McFarland, Martina Stella, Fabijan Pavošević 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19912v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	10.0/10	0.0

评分理由: 该论文专注于开发一种数据高效的主动学习框架，用于发现用于I型光动力疗法的过渡金属光敏剂。它结合了化学结构化的设计空间、DFT计算和预训练的原子表示。论文的核心是计算化学和材料发现，而不是大模型或深度学习技术。因此，除了’AI for Science OR Bioinformatics OR Cheminformatics’（评分为10分，因为该研究是AI在化学/科学领域的应用）之外，所有其他关键词（主要关于大模型技术、训练方法、推理优化、对齐、代理等）都完全不相关，评分为0分。

!!! tip deepseek-chat TL;DR

该研究开发了一种数据高效的主动学习框架，结合预训练的原子表示和量子化学计算，从超过210万个过渡金属配合物中高效发现用于I型光动力疗法的光敏剂，并揭示了关键的设计原则。

摘要翻译

过渡金属配合物（TMCs）是一类在I型光动力疗法（PDT）中极具前景的光敏剂，其电子转移过程可在缺氧条件下产生活性氧。然而，在TMCs广阔的化学空间中，筛选出同时具备所需基态与激发态氧化还原能量的候选化合物仍具挑战。本研究开发了一种数据高效主动学习（AL）框架，用于发现I型活性TMC光敏剂。该框架将超过210万个Ru(II)、Os(II)和Ir(III)配合物构成的化学结构化设计空间，与目标密度泛函理论（DFT）计算及预训练的原子表示相结合。仅通过300次量子化学计算评估，该方法便高效地富集了位于机理定义的最优氧化还原区域内的候选化合物。对可行配合物的分析揭示了金属特性、配体骨架、取代基模式及物理化学性质与I型光反应性之间的化学设计原则，包括对Os(II)基配合物的显著偏好、电子不对称的配体环境，以及给电子与吸电子取代基的组合。更广泛而言，本文提出的策略为合理设计过渡金属光催化剂提供了一条可扩展、机理导向的路径，其应用可涵盖生物医学、太阳能转换及光氧化还原化学等领域。

摘要 (Abstract)

Transition-metal complexes (TMCs) are promising photosensitizers for TypeI photodynamic therapy (PDT), where electron-transfer processes can generate reactive oxygen species under hypoxic conditions. Yet identifying candidates with the required ground- and excited-state redox energetics remains challenging across the vast chemical space of TMCs. Here, we develop a data-efficient active learning (AL) framework for the discovery of TypeI active TMC photosensitizers by combining a chemically structured design space of over 2.1 million Ru(II), Os(II), and Ir(III) complexes with targeted DFT calculations and pretrained atomistic representations. With only 300 quantum-chemical evaluations, the approach efficiently enriches candidates within a mechanistically defined optimal redox region. Analysis of the viable complexes reveals chemical design principles linking metal identity, ligand framework, substituent pattern, and physicochemical properties to Type~I photoreactivity, including a pronounced preference for Os(II)-based complexes and electronically asymmetric ligand environments along with combination of electronic donating and accepting substituents. More broadly, the strategy presented herein provides a scalable, mechanism-guided route for the rational design of transition-metal photocatalysts for applications spanning biomedicine, solar energy conversion, and photoredox chemistry.

关键词: active learning, transition-metal complexes, photosensitizers, Type I photodynamic therapy, density functional theory, atomistic representations, redox energetics, chemical design principles

257. ❌ Electromagnetic coupling between subradiant plasmons and dye molecular excitons analyzed by spectral changes in ultrafast surface-enhanced fluorescence

作者: Tamitake Itoh, Yuko S. Yamamoto 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19869v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	0.0/10	0.0

评分理由: 该论文研究纳米光子学中电磁耦合的物理现象，使用银纳米颗粒二聚体和染料分子进行实验分析，属于物理化学和纳米技术领域。所有评分关键词均涉及大模型、深度学习及相关技术，与该论文的物理实验研究内容完全无关，因此所有关键词相关度均为0分。

!!! tip deepseek-chat TL;DR

该论文开发了一种基于超快表面增强荧光的方法来评估亚辐射等离子体与染料分子激子之间的电磁耦合，并通过耦合振荡器模型解释了静态和瞬态光谱特性。

摘要翻译

分子激子与等离激元之间的电磁耦合通常通过瑞利散射或消光光谱进行研究。然而，评估涉及亚辐射等离激元的电磁耦合具有挑战性，因为该共振模式在远场光谱中并不明显显现。本研究开发了一种利用超快表面增强荧光（ultrafast SEF）衍生的电磁增强因子（FR）来评估此类耦合的方法。该表面增强荧光在表面增强共振拉曼散射（SERRS）光谱中表现为宽背景信号，我们通过测量纳米间隙内含染料分子的银纳米粒子二聚体获得了相关数据。结果表明，亚辐射共振的FR光谱峰出现在瑞利散射光谱的凹陷位置附近。此外，在超快表面增强荧光与表面增强共振拉曼散射的淬灭过程中，这些FR峰均呈现蓝移现象。我们采用由辐射等离激元、亚辐射等离激元及分子激子组成的耦合振子模型分析了这些静态与瞬时光谱特性。静态特性可通过增加辐射等离激元共振的线宽得以复现，而瞬态特性则通过降低激子与两种等离激元振子间的电磁耦合能量来再现。这些发现表明，该方法为评估亚辐射等离激元与分子激子间的电磁耦合提供了有力工具。

摘要 (Abstract)

Electromagnetic (EM) coupling between molecular exciton and plasmon has been studied using in Rayleigh scattering or extinction spectroscopy. However, evaluating EM coupling involving subradiant plasmon is challenging because this resonance does not manifest clearly in far-field spectra. In this study, we developed a method to evaluate such coupling using EM enhancement factors (FR) derived from ultrafast surface-enhanced fluorescence (ultrafast SEF). This SEF, which appears as a broad background in surface-enhanced resonant Raman scattering (SERRS) spectra, were measured using silver nanoparticle dimers containing dye molecules within their nanogaps. Our results show that the spectral peaks of FR for subradiant resonances appear near the dips in Rayleigh scattering spectra. Furthermore, these FR peaks exhibit blue-shifts during the quenching processes of both ultrafast SEF and SERRS. We examined these static and temporal spectral properties using a coupled oscillator model composed of radiant plasmons, subradiant plasmons, and molecular excitons. The static properties were reproduced by increasing the linewidths of the radiant plasmon resonance, while the temporal properties were captured by decreasing the EM coupling energies between the exciton and both plasmon oscillators. These findings indicate that this methodology is a powerful tool for evaluating EM coupling between subradiant plasmons and molecular excitons.

关键词: electromagnetic coupling, subradiant plasmons, molecular excitons, ultrafast surface-enhanced fluorescence, silver nanoparticle dimers, coupled oscillator model, spectral analysis, nanogap

258. ❌ First-principle study of the influence of hydroxyapatite on magnesium surfaces

作者: Anthony Veit Berg, Ablai Forster, Tim Hansson, Alexandra J. Jernstedt, Emmy Salminen, Elsebeth Schröder 期刊/来源: arxiv 发布日期: 2026-03-20 arXiv链接: http://arxiv.org/abs/2603.19823v1

评分: 0.0 / 26.6 ❌

评分详情

关键词	权重	相关度	得分
Large Language Models OR LLMs OR Foundation Models	0.0	0.0/10	0.0
Mixture of Experts OR MoE OR Sparse Models	0.0	0.0/10	0.0
Small Language Models OR SLMs OR On-device AI	0.0	0.0/10	0.0
Scaling Laws AND Data Quality	0.0	0.0/10	0.0
Pre-training OR Continual Pre-training OR Domain Adaptation	0.0	0.0/10	0.0
Post-training OR Supervised Fine-tuning OR SFT	0.0	0.0/10	0.0
Instruction Tuning OR Alignment OR Value Alignment	0.0	0.0/10	0.0
RLHF OR RLAIF OR Direct Preference Optimization OR DPO	0.0	0.0/10	0.0
PEFT OR LoRA OR Parameter-efficient Fine-tuning	0.0	0.0/10	0.0
Retrieval-Augmented Generation OR RAG OR Retrieval-Generation	0.0	0.0/10	0.0
Context Window Extension OR Long Context LLMs	0.0	0.0/10	0.0
KV Cache Compression OR Linear Attention OR FlashAttention	0.0	0.0/10	0.0
Chain of Thought OR CoT Reasoning OR Multi-step Reasoning	0.0	0.0/10	0.0
System 2 Thinking OR Slow Thinking OR In-depth Reasoning	0.0	0.0/10	0.0
Monte Carlo Tree Search OR MCTS AND LLM	0.0	0.0/10	0.0
Self-Correction OR Self-Improvement OR Self-Reflection	0.0	0.0/10	0.0
LLM Agents OR Autonomous Agents OR Agentic Workflow	0.0	0.0/10	0.0
Tool Use OR Function Calling OR API Tool Use	0.0	0.0/10	0.0
Multi-agent Systems OR Agent Coordination	0.0	0.0/10	0.0
Quantization OR Model Compression OR Low-bit Weights	0.0	0.0/10	0.0
Speculative Decoding OR Inference Acceleration	0.0	0.0/10	0.0
Hallucination Mitigation OR Factuality OR Truthfulness	0.0	0.0/10	0.0
Mechanistic Interpretability OR Explainable AI	0.0	0.0/10	0.0
World Models AND General World Models	0.0	0.0/10	0.0
Model Merging OR Model Soups OR Weight Averaging	0.0	0.0/10	0.0
In-context Learning OR Many-shot Learning	0.0	0.0/10	0.0
AI for Science OR Bioinformatics OR Cheminformatics	0.0	5.0/10	0.0

评分理由: 该论文使用密度泛函理论研究羟基磷灰石在镁表面的吸附行为，属于材料科学和计算化学领域，与深度学习、大模型技术完全无关。唯一可能相关的关键词是’AI for Science OR Bioinformatics OR Cheminformatics’，因为该研究属于计算材料科学，可视为科学计算应用，但论文未使用AI方法，仅使用传统第一性原理计算，因此给予5分（有一定关联）。其他所有关键词均与大模型技术、训练方法、推理优化、AI代理等无关，得0分。

!!! tip deepseek-chat TL;DR

该研究通过第一性原理计算探究了羟基磷灰石在纯镁及钙/锌掺杂镁表面的吸附行为，发现掺杂改善了吸附能，钙掺杂时特定构型下钙原子会迁移至羟基磷灰石层中，并揭示了电子密度变化规律。

摘要翻译

采用密度泛函理论研究了镁（Mg）表面羟基磷灰石（HA）的吸附行为，以助于理解HA涂层及合金化对镁基可降解植入体表面的影响。我们测算了单层HA在纯Mg(0001)表面以及稀疏钙（Ca）或锌（Zn）掺杂Mg(0001)表面的吸附能与结构变化，发现除HA相对于掺杂原子处于少数特定位置外，Zn和Ca掺杂均能增强吸附作用。所有吸附构型（无论是纯镁还是掺杂镁表面）均表现出表面与HA层的形变。对于Ca掺杂，我们发现当HA处于特定吸附构型时，掺杂的Ca原子会脱离镁表面并迁移至HA层中，同时在镁表面顶层留下一个镁空位。电子密度变化图显示，电子在Ca掺杂原子及其邻近的镁原子周围聚集，而Zn掺杂体系中这一现象较弱。总体而言，我们的结果表明掺杂元素的选择以及HA的相对位置会影响HA与镁表面的相互作用，并同时影响吸附能、原子结构与电子结构。

摘要 (Abstract)

Hydroxyapatite (HA) on a magnesium (Mg) surface is studied using density functional theory, to help understand the effect of HA coating and alloying in the surfaces of Mg-based biodegradable implants. We determine the adsorption energies and structural changes of a single layer of HA on pure Mg(0001) and on sparsely calcium (Ca) or zinc (Zn) doped Mg(0001) and find that both Zn and Ca doping improves the adsorption, except in a few positions of HA relative to the dopant position. All adsorption configurations, whether with pure or doped Mg surfaces, show deformation of the surface and HA layer. For Ca doping, we found that for a certain adsorption configuration, the dopant Ca atom moves out of the Mg surface and into the HA layer, leaving behind a Mg vacancy in the top layer of the Mg surface. Plots of electron density changes show that electrons accumulate around the Ca dopant and the neighboring Mg atoms, while in Zn doping this is less pronounced. Overall, our results demonstrate that the dopant choice and relative position of HA influence the interaction between HA and Mg-surfaces, and affect both adsorption energies and atomic and electronic structures.

关键词: hydroxyapatite, magnesium surface, density functional theory, adsorption energy, calcium doping, zinc doping, electron density, biodegradable implants

Token 消耗统计

总计: 798,631 tokens（输入 534,193 / 输出 264,438）

模型	输入	输出	合计
deepseek-chat	460,621	254,557	715,178
glm-4.7	73,572	9,881	83,453

📊 ArXiv 研究报告 (2026-03-24)

📌 配置信息

关键词列表（共 27 个，总权重 27.0）

评分设置

📈 论文统计

⭐ 及格论文详细分析

1. Evaluating Evidence Grounding Under User Pressure in Instruction-Tuned Language Models

评估指令微调语言模型在用户压力下的证据基础能力

2. WorldAgents: Can Foundation Image Models be Agents for 3D World Models?

3. PowerLens: Taming LLM Agents for Safe and Personalized Mobile Power Management

PowerLens：驯化大模型智能体以实现安全且个性化的移动电源管理

4. Can Large Multimodal Models Inspect Buildings? A Hierarchical Benchmark for Structural Pathology Rea

5. SAGE: Sustainable Agent-Guided Expert-tuning for Culturally Attuned Translation in Low-Resource Sout

6. All-Mem: Agentic Lifelong Memory via Dynamic Topology Evolution

7. TextReasoningBench: Does Reasoning Really Improve Text Classification in Large Language Models?

TextReasoningBench：推理真的能改进大语言模型的文本分类吗？

8. DataProphet: Demystifying Supervision Data Generalization in Multimodal LLMs

📋 所有论文列表

1. ✅ Evaluating Evidence Grounding Under User Pressure in Instruction-Tuned Language Models

2. ✅ WorldAgents: Can Foundation Image Models be Agents for 3D World Models?

3. ✅ PowerLens: Taming LLM Agents for Safe and Personalized Mobile Power Management

4. ✅ Can Large Multimodal Models Inspect Buildings? A Hierarchical Benchmark for Structural Pathology Reasoning

5. ✅ SAGE: Sustainable Agent-Guided Expert-tuning for Culturally Attuned Translation in Low-Resource Southeast Asia

6. ✅ All-Mem: Agentic Lifelong Memory via Dynamic Topology Evolution

7. ✅ TextReasoningBench: Does Reasoning Really Improve Text Classification in Large Language Models?

8. ✅ DataProphet: Demystifying Supervision Data Generalization in Multimodal LLMs

9. ❌ PoC: Performance-oriented Context Compression for Large Language Models via Performance Prediction

10. ❌ LLM-Enhanced Semantic Data Integration of Electronic Component Qualifications in the Aerospace Domain

11. ❌ Enhancing Alignment for Unified Multimodal Models via Semantically-Grounded Supervision

12. ❌ Structured Latent Dynamics in Wireless CSI via Homomorphic World Models

13. ❌ Evolving Embodied Intelligence: Graph Neural Network–Driven Co-Design of Morphology and Control in Soft Robotics

14. ❌ VideoSeek: Long-Horizon Video Agent with Tool-Guided Seeking

15. ❌ LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation

16. ❌ From Masks to Pixels and Meaning: A New Taxonomy, Benchmark, and Metrics for VLM Image Tampering

17. ❌ Improving Generalization on Cybersecurity Tasks with Multi-Modal Contrastive Learning

18. ❌ Adaptive Greedy Frame Selection for Long Video Understanding

19. ❌ AI Agents Can Already Autonomously Perform Experimental High Energy Physics

20. ❌ Learning Dynamic Belief Graphs for Theory-of-mind Reasoning

21. ❌ The Robot’s Inner Critic: Self-Refinement of Social Behaviors through VLM-based Replanning

22. ❌ Measuring Faithfulness Depends on How You Measure: Classifier Sensitivity in LLM Chain-of-Thought Evaluation

23. ❌ Semantic Token Clustering for Efficient Uncertainty Quantification in Large Language Models

24. ❌ Design-OS: A Specification-Driven Framework for Engineering System Design with a Control-Systems Design Case

25. ❌ Enhancing Hyperspace Analogue to Language (HAL) Representations via Attention-Based Pooling for Text Classification

26. ❌ An Agentic Multi-Agent Architecture for Cybersecurity Risk Management

27. ❌ Evolving Jailbreaks: Automated Multi-Objective Long-Tail Attacks on Large Language Models

28. ❌ Chain-of-Adaptation: Surgical Vision-Language Adaptation with Reinforcement Learning

29. ❌ Var-JEPA: A Variational Formulation of the Joint-Embedding Predictive Architecture – Bridging Predictive and Generative Self-Supervised Learning

30. ❌ Demonstration of Adapt4Me: An Uncertainty-Aware Authoring Environment for Personalizing Automatic Speech Recognition to Non-normative Speech

31. ❌ The $\mathbf{Y}$-Combinator for LLMs: Solving Long-Context Rot with $λ$-Calculus

32. ❌ Spectral Alignment in Forward-Backward Representations via Temporal Abstraction

33. ❌ Pitfalls in Evaluating Interpretability Agents

34. ❌ An Empirical Study of SFT-DPO Interaction and Parameterization in Small Language Models

35. ❌ Fine-tuning Timeseries Predictors Using Reinforcement Learning

36. ❌ Agentic Harness for Real-World Compilers

37. ❌ The End of Rented Discovery: How AI Search Redistributes Power Between Hotels and Intermediaries

38. ❌ Experience is the Best Teacher: Motivating Effective Exploration in Reinforcement Learning for LLMs

39. ❌ DIAL-KG: Schema-Free Incremental Knowledge Graph Construction via Dynamic Schema Induction and Evolution-Intent Assessment

40. ❌ LoASR-Bench: Evaluating Large Speech Language Models on Low-Resource Automatic Speech Recognition Across Language Families

41. ❌ CoverageBench: Evaluating Information Coverage across Tasks and Domains

42. ❌ Orchestrating Human-AI Software Delivery: A Retrospective Longitudinal Field Study of Three Software Modernization Programs

43. ❌ Detached Skip-Links and $R$-Probe: Decoupling Feature Aggregation from Gradient Propagation for MLLM OCR

44. ❌ Physics-Informed Long-Range Coulomb Correction for Machine-learning Hamiltonians

45. ❌ Breaking the Capability Ceiling of LLM Post-Training by Reintroducing Markov States

46. ❌ Promoting Critical Thinking With Domain-Specific Generative AI Provocations

47. ❌ X-World: Controllable Ego-Centric Multi-Camera World Models for Scalable End-to-End Driving

48. ❌ Trojan’s Whisper: Stealthy Manipulation of OpenClaw through Injected Bootstrapped Guidance

49. ❌ On the Ability of Transformers to Verify Plans

50. ❌ Graph2TS: Structure-Controlled Time Series Generation via Quantile-Graph VAEs

51. ❌ HiPath: Hierarchical Vision-Language Alignment for Structured Pathology Report Prediction

52. ❌ RAM: Recover Any 3D Human Motion in-the-Wild

53. ❌ Span-Level Machine Translation Meta-Evaluation

54. ❌ Learning Like Humans: Analogical Concept Learning for Generalized Category Discovery

55. ❌ Revealing Domain-Spatiality Patterns for Configuration Tuning: Domain Knowledge Meets Fitness Landscapes

56. ❌ Utility-Guided Agent Orchestration for Efficient LLM Tool Use

57. ❌ Integrating Meta-Features with Knowledge Graph Embeddings for Meta-Learning

58. ❌ What If Consensus Lies? Selective-Complementary Reinforcement Learning at Test Time

59. ❌ Failure Modes for Deep Learning-Based Online Mapping: How to Measure and Address Them

60. ❌ Semantic Delta: An Interpretable Signal Differentiating Human and LLMs Dialogue

61. ❌ Gesture2Speech: How Far Can Hand Movements Shape Expressive Speech?

62. ❌ FormalEvolve: Neuro-Symbolic Evolutionary Search for Diverse and Prover-Effective Autoformalization